This notebook demonstrates the full lifecycle of machine learning model development based on the COVID-19 patienet case that would require admission to the ICU, that is also one of the Kaggle's competition. The notebook consists of the following main sections:

  1. Executive Summary Report
  2. Loading analysis packages
  3. Data Laoding
  4. Data Cleaning and Pre-Processing
  5. Data Exploration, Univariate and Bivariate Exploration
  6. Modelling and Model Evaluation

Executive Summary Report¶

Overview.¶

This project's main objective was to develop a machine learning solution for determining if a patient will be admitted to the intensive care unit. This is done with the hopes that ICU resources can be set up, patient transfers can be planned, frontline doctors may safely discharge these patients, and remote follow-up can be done with them.

The project was concerned with Covid-19 cases, with more than 16 million confirmed illnesses and 454,429 confirmed deaths by May 26, 2021, Brazil is one of the nations most impacted by the COVID-19 pandemic (according to the Johns Hopkins Coronavirus Resource Center). Brazil was one of the nations most affected by the first wave of Covid-19, which had its first case on February 26, 2020 and began spreading in communities on March 20, 2020, leaving Brazil unprepared and unable to respond because of the pressure on hospital capacity, including the lengthy and intense requests for ICU (incentive care unit) beds, staff, personal protection equipment, and medical resources.

Data¶

This dataset, which comprises of 1925 rows and 231 columns, was collected using anonymized data from the Hospital Sri-Libanês, So Paulo, and Brasilia. PATIENT VISIT IDENTIFIER, which includes 385 patients with various 5 rows of records, is the dataset's unique identifier. The target is the ICU columns, which indicates whether the patient was admitted or not using a binary representation of 1 for admission and 0 for non-admission, and the Window column indicate the time-period the patient was admitted. Other notable features in this dataset

    1. Patient demographic information (03)
    1. Patient previous grouped diseases (09)
    1. Blood results (36)
    1. Vital signs (06)
    1. Missing data was represented has Nan, consists of 223863 missing cells (50.3%), and no duplicated data

It was observed that 195 patients were admitted to ICU and 190 patients were not admitted

Data Cleaning and pre-processing.¶

The dataset was transpose to convert the unique identifier to one row per patient. Hence, converting the data to 385 rows and 1151 columns

Taking care missing values¶

We have 223,863 missing cells (50.3%), and some features having over 89% missing data Columns with over 50% overrepresentation of null values were dropped. The other missing values were imputed with backward fill as suggested by the dataset providers. This downsized the dataset to 230 columns and 394 rows, as one of the patient visit identifier has a complete missing values in the entire row. As instructed by the dataset providers, that the data gathered after the ICU admission should not be taken into consideration, and patient admitted in window_1 (0-2Hours). Hence, our dataset was reduced further to 352rows and 50 columns after eliminating this rows and columns During that exploration the dataset was divided into patient-constant features (patient constant features are features that contain the same value for a single patient across all time points) and time-variant features (Time-variant features are features that contain multiple values for the same patient, such as multiple lab test results for a single patient over time) for better visualization According to the patient-constant-features, more males and patients over the age of 65 were admitted to the ICU, while the time-variant-features were able to show correlation with the DIFF cluster and DIFF_REL.

Correlation Checking the correlation with visualization wasn’t helpful because of large data, the data was split into two, first we are looked at how these features relate to each other, excluding the target column, and a stacked format was adopted to better understand the correlation

Secondly, we looked at how these features relate to ICU

Feature Encoding Feature encoding was performed on the Age percentile has it consist of columns with object.

Methodology(Modelling and Model Evaluation)¶

The dataset was split into train and validation, 90% of the data was allocated to the train data and 10% to the validation. Ensemble Learning Methods was implemented using 8 modelling algorith, some of the results are particularly outstanding, especially given that 46% of genuine values in our desired frequency. A perfect model will have an ROC-AUC score of 1, while a model that is no better than random guessing will have an ROC-AUC score of 0.5. Hence, algorithm such as KNN, Decision Tree, SVM have low ROC-AUC score The results are significantly better on Random-Forest. This is an excellent sign because some of the models wasn't significantly more accurate than picking patients at random to be admitted to the intensive care unit Cross-validation has several advantages over other methods for evaluating the performance of a model. For example, it can provide a more accurate estimate of the model's performance because it uses more of the data for training and evaluation. It can also be used to tune hyper-parameters, which are model-specific parameters that cannot be learned from the data.

Hyper-parameters. Feature selection and hyper-parameters was performed on the Random-Forest model to better enhanced the accuracy We were able to enhance the validation accuracy by around 6% just by adjusting the algorithm's settings. Additionally, we have reached a point where our model should be able to forecast patients who will need an ICU bed in more than 80% of situations.

Conclusion and limitation¶

We were able to develop prediction models on this notebook for the ICU admission classification issue. For each patient, it concentrated on the earliest data available, creating a model that was reasonably accurate. The model's ability to successfully categorise patients for both goal values is one sign that this data processing stage was successful.

We must reiterate our caution that working with tiny datasets restricts how confident we can be in our findings.

Loading analysis packages¶

In [1]:
import math
import pandas as pd
import numpy as np
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process.kernels import RBF
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score  
from sklearn.metrics import precision_score, recall_score, roc_auc_score
warnings.filterwarnings('ignore')
import missingno as msno
from natsort import index_natsorted

# Plotly graphic library
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import missingno as msno
%matplotlib inline

pd.set_option('display.max_columns', 100)

Data Loading ¶

In [2]:
#I converted the dataset to CSV from Excel file

data = pd.read_csv("ICU_Prediction.csv")

The first thing we did was to convert our dataset to CSV, then load our data.

Here is an excerpt of the the data description for the competition:

  • Available data includes- Patient demographic information (03), Patient previous grouped diseases (09), Blood results (36), Vital signs (06)

  • The WINDOW columns signifies the time frame patient was transferred to ICU and it has 5 time slot.

  • The ICU columns signifies whether or not a patient was admitted to ICU.

Let's have a first peek at the dataset first and last rows to confirm all of this.

In [3]:
data.head()
Out[3]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 HTN IMMUNOCOMPROMISED OTHER ALBUMIN_MEDIAN ALBUMIN_MEAN ALBUMIN_MIN ALBUMIN_MAX ALBUMIN_DIFF BE_ARTERIAL_MEDIAN BE_ARTERIAL_MEAN BE_ARTERIAL_MIN BE_ARTERIAL_MAX BE_ARTERIAL_DIFF BE_VENOUS_MEDIAN BE_VENOUS_MEAN BE_VENOUS_MIN BE_VENOUS_MAX BE_VENOUS_DIFF BIC_ARTERIAL_MEDIAN BIC_ARTERIAL_MEAN BIC_ARTERIAL_MIN BIC_ARTERIAL_MAX BIC_ARTERIAL_DIFF BIC_VENOUS_MEDIAN BIC_VENOUS_MEAN BIC_VENOUS_MIN BIC_VENOUS_MAX BIC_VENOUS_DIFF BILLIRUBIN_MEDIAN BILLIRUBIN_MEAN BILLIRUBIN_MIN BILLIRUBIN_MAX BILLIRUBIN_DIFF BLAST_MEDIAN BLAST_MEAN BLAST_MIN BLAST_MAX BLAST_DIFF CALCIUM_MEDIAN CALCIUM_MEAN ... TTPA_MAX TTPA_DIFF UREA_MEDIAN UREA_MEAN UREA_MIN UREA_MAX UREA_DIFF DIMER_MEDIAN DIMER_MEAN DIMER_MIN DIMER_MAX DIMER_DIFF BLOODPRESSURE_DIASTOLIC_MEAN BLOODPRESSURE_SISTOLIC_MEAN HEART_RATE_MEAN RESPIRATORY_RATE_MEAN TEMPERATURE_MEAN OXYGEN_SATURATION_MEAN BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN BLOODPRESSURE_DIASTOLIC_MIN BLOODPRESSURE_SISTOLIC_MIN HEART_RATE_MIN RESPIRATORY_RATE_MIN TEMPERATURE_MIN OXYGEN_SATURATION_MIN BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX RESPIRATORY_RATE_MAX TEMPERATURE_MAX OXYGEN_SATURATION_MAX BLOODPRESSURE_DIASTOLIC_DIFF BLOODPRESSURE_SISTOLIC_DIFF HEART_RATE_DIFF RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
0 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.086420 -0.230769 -0.283019 -0.593220 -0.285714 0.736842 0.086420 -0.230769 -0.283019 -0.586207 -0.285714 0.736842 0.237113 0.0000 -0.162393 -0.500000 0.208791 0.898990 -0.247863 -0.459459 -0.432836 -0.636364 -0.420290 0.736842 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0-2 0
1 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.333333 -0.230769 -0.132075 -0.593220 0.535714 0.578947 0.333333 -0.230769 -0.132075 -0.586207 0.535714 0.578947 0.443299 0.0000 -0.025641 -0.500000 0.714286 0.838384 -0.076923 -0.459459 -0.313433 -0.636364 0.246377 0.578947 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 2-4 0
2 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.000000 -1.000000 -1.000000 -1.000000 -1.0 -1.000000 -1.000000 -1.000000 -1.000000 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.938950 -0.938950 -0.938950 -0.938950 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.183673 0.183673 ... -0.825613 -1.0 -0.836145 -0.836145 -0.836145 -0.836145 -1.0 -0.994912 -0.994912 -0.994912 -0.994912 -1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4-6 0
3 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -0.107143 0.736842 NaN NaN NaN NaN -0.107143 0.736842 NaN NaN NaN NaN 0.318681 0.898990 NaN NaN NaN NaN -0.275362 0.736842 NaN NaN NaN NaN -1.000000 -1.000000 NaN NaN NaN NaN -1.000000 -1.000000 6-12 0
4 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.000000 0.000000 0.000000 0.000000 -1.0 -0.871658 -0.871658 -0.871658 -0.871658 -1.0 -0.863874 -0.863874 -0.863874 -0.863874 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.414634 -0.414634 -0.414634 -0.414634 -1.0 -0.979069 -0.979069 -0.979069 -0.979069 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.326531 0.326531 ... -0.846633 -1.0 -0.836145 -0.836145 -0.836145 -0.836145 -1.0 -0.996762 -0.996762 -0.996762 -0.996762 -1.0 -0.243021 -0.338537 -0.213031 -0.317859 0.033779 0.665932 -0.283951 -0.376923 -0.188679 -0.379310 0.035714 0.631579 -0.340206 -0.4875 -0.572650 -0.857143 0.098901 0.797980 -0.076923 0.286486 0.298507 0.272727 0.362319 0.947368 -0.33913 0.325153 0.114504 0.176471 -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 ABOVE_12 1

5 rows × 231 columns

In [4]:
data.tail()
Out[4]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 HTN IMMUNOCOMPROMISED OTHER ALBUMIN_MEDIAN ALBUMIN_MEAN ALBUMIN_MIN ALBUMIN_MAX ALBUMIN_DIFF BE_ARTERIAL_MEDIAN BE_ARTERIAL_MEAN BE_ARTERIAL_MIN BE_ARTERIAL_MAX BE_ARTERIAL_DIFF BE_VENOUS_MEDIAN BE_VENOUS_MEAN BE_VENOUS_MIN BE_VENOUS_MAX BE_VENOUS_DIFF BIC_ARTERIAL_MEDIAN BIC_ARTERIAL_MEAN BIC_ARTERIAL_MIN BIC_ARTERIAL_MAX BIC_ARTERIAL_DIFF BIC_VENOUS_MEDIAN BIC_VENOUS_MEAN BIC_VENOUS_MIN BIC_VENOUS_MAX BIC_VENOUS_DIFF BILLIRUBIN_MEDIAN BILLIRUBIN_MEAN BILLIRUBIN_MIN BILLIRUBIN_MAX BILLIRUBIN_DIFF BLAST_MEDIAN BLAST_MEAN BLAST_MIN BLAST_MAX BLAST_DIFF CALCIUM_MEDIAN CALCIUM_MEAN ... TTPA_MAX TTPA_DIFF UREA_MEDIAN UREA_MEAN UREA_MIN UREA_MAX UREA_DIFF DIMER_MEDIAN DIMER_MEAN DIMER_MIN DIMER_MAX DIMER_DIFF BLOODPRESSURE_DIASTOLIC_MEAN BLOODPRESSURE_SISTOLIC_MEAN HEART_RATE_MEAN RESPIRATORY_RATE_MEAN TEMPERATURE_MEAN OXYGEN_SATURATION_MEAN BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN BLOODPRESSURE_DIASTOLIC_MIN BLOODPRESSURE_SISTOLIC_MIN HEART_RATE_MIN RESPIRATORY_RATE_MIN TEMPERATURE_MIN OXYGEN_SATURATION_MIN BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX RESPIRATORY_RATE_MAX TEMPERATURE_MAX OXYGEN_SATURATION_MAX BLOODPRESSURE_DIASTOLIC_DIFF BLOODPRESSURE_SISTOLIC_DIFF HEART_RATE_DIFF RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
1920 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.012346 -0.292308 0.056604 -0.525424 0.535714 0.789474 0.012346 -0.292308 0.056604 -0.517241 0.535714 0.789474 0.175258 -0.050 0.145299 -0.428571 0.714286 0.919192 -0.299145 -0.502703 -0.164179 -0.575758 0.246377 0.789474 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0-2 0
1921 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.717277 -0.717277 -0.717277 -0.717277 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.170732 -0.170732 -0.170732 -0.170732 -1.0 -0.982208 -0.982208 -0.982208 -0.982208 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.244898 0.244898 ... -0.869210 -1.0 -0.879518 -0.879518 -0.879518 -0.879518 -1.0 -0.979571 -0.979571 -0.979571 -0.979571 -1.0 0.086420 -0.384615 -0.113208 -0.593220 0.142857 0.578947 0.086420 -0.384615 -0.113208 -0.586207 0.142857 0.578947 0.237113 -0.125 -0.008547 -0.500000 0.472527 0.838384 -0.247863 -0.567568 -0.298507 -0.636364 -0.072464 0.578947 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 2-4 0
1922 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.086420 -0.230769 -0.169811 -0.593220 0.142857 0.736842 0.086420 -0.230769 -0.169811 -0.586207 0.142857 0.736842 0.237113 0.000 -0.059829 -0.500000 0.472527 0.898990 -0.247863 -0.459459 -0.343284 -0.636364 -0.072464 0.736842 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 4-6 0
1923 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.209877 -0.384615 -0.188679 -0.661017 0.285714 0.473684 0.209877 -0.384615 -0.188679 -0.655172 0.285714 0.473684 0.340206 -0.125 -0.076923 -0.571429 0.560440 0.797980 -0.162393 -0.567568 -0.358209 -0.696970 0.043478 0.473684 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 6-12 0
1924 384 0 50th 1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.000000 -1.000000 -1.000000 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.983255 -0.983255 -0.983255 -0.983255 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.306122 0.306122 ... -0.846633 -1.0 -0.807229 -0.807229 -0.807229 -0.807229 -1.0 -0.888448 -0.888448 -0.888448 -0.888448 -1.0 -0.185185 -0.539103 -0.107704 -0.610169 0.050595 0.662281 -0.160494 -0.538462 -0.075472 -0.586207 0.071429 0.631579 -0.175258 -0.375 -0.247863 -0.785714 0.186813 0.777778 -0.247863 -0.470270 -0.149254 -0.515152 0.101449 0.842105 -0.652174 -0.644172 -0.633588 -0.647059 -0.547619 -0.838384 -0.701863 -0.585967 -0.763868 -0.612903 -0.551337 -0.835052 ABOVE_12 0

5 rows × 231 columns

Observation:

  • PATIENT_VISIT_IDENTIFIER is a unique ID
  • 299 columns contains variables with integer or float values
  • 2 columns contains variable with objects
  • variables with NaN representing missing values
In [5]:
print("Dataset contains (rows, cols):",data.shape)
Dataset contains (rows, cols): (1925, 231)
In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Columns: 231 entries, PATIENT_VISIT_IDENTIFIER to ICU
dtypes: float64(225), int64(4), object(2)
memory usage: 3.4+ MB

Observation:

  • with the info() method, we see that the data type are integer, float and object
In [7]:
profile_data = pdp.ProfileReport(data, 
                                      minimal = True, 
                                      explorative=True, 
                                      title = 'ProfilingResults',
                                      progress_bar=True)
In [8]:
profile_data
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[8]:

This data is set up in a very specific way. For each patient, as defined in the PATIENT_VISIT_IDENTIFIER feature, the entries represent different stages of the patient since its admission into the hospital. Lets take a look at how many entries are there for each patient.

METADATA¶

In [9]:
data.columns
Out[9]:
Index(['PATIENT_VISIT_IDENTIFIER', 'AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER',
       'DISEASE GROUPING 1', 'DISEASE GROUPING 2', 'DISEASE GROUPING 3',
       'DISEASE GROUPING 4', 'DISEASE GROUPING 5', 'DISEASE GROUPING 6',
       ...
       'TEMPERATURE_DIFF', 'OXYGEN_SATURATION_DIFF',
       'BLOODPRESSURE_DIASTOLIC_DIFF_REL', 'BLOODPRESSURE_SISTOLIC_DIFF_REL',
       'HEART_RATE_DIFF_REL', 'RESPIRATORY_RATE_DIFF_REL',
       'TEMPERATURE_DIFF_REL', 'OXYGEN_SATURATION_DIFF_REL', 'WINDOW', 'ICU'],
      dtype='object', length=231)
In [10]:
data.drop_duplicates()
data.shape
Out[10]:
(1925, 231)

Observations:

  • Dataset has 1,925 rows and 231 variables
  • no duplication in Dataset
In [11]:
#Confirming number of entries per patient
data.groupby(by = 'PATIENT_VISIT_IDENTIFIER').count()['ICU'].sort_values(ascending = False)
Out[11]:
PATIENT_VISIT_IDENTIFIER
0      5
193    5
263    5
262    5
261    5
      ..
126    5
125    5
124    5
123    5
384    5
Name: ICU, Length: 385, dtype: int64

As we can see, all patients in this dataset have the same number of entries. However, as instructed, data from patients who have already been transferred to the ICU are not to be used. But we are not yet discarding this data. First, we'll extract two critical pieces of information from the raw data:

  • Which patients are admitted to the ICU;
  • When these patients are admitted to the ICU.
In [12]:
# function to confirm the patient admitted to the ICU
def ICU_admission(data):
    admission_data = data.groupby(
        by = 'PATIENT_VISIT_IDENTIFIER', 
        as_index = False).max()[['PATIENT_VISIT_IDENTIFIER', 'ICU']]
    
    admission_time_data = data.groupby(by = ['PATIENT_VISIT_IDENTIFIER', 'ICU'],
                                       as_index = False).first()[['PATIENT_VISIT_IDENTIFIER', 'ICU', 'WINDOW']]
    
    admission_data = admission_data.join(
        other = admission_time_data[admission_time_data['ICU'] == 1].set_index('PATIENT_VISIT_IDENTIFIER'),
        on = 'PATIENT_VISIT_IDENTIFIER',
        how = 'left',
        rsuffix = '_R')
    
    return admission_data.drop(columns = 'ICU_R')
In [13]:
admission_data = ICU_admission(data)
print(admission_data)
     PATIENT_VISIT_IDENTIFIER  ICU    WINDOW
0                           0    1  ABOVE_12
1                           1    1       0-2
2                           2    1  ABOVE_12
3                           3    0       NaN
4                           4    0       NaN
..                        ...  ...       ...
380                       380    1  ABOVE_12
381                       381    0       NaN
382                       382    1  ABOVE_12
383                       383    0       NaN
384                       384    0       NaN

[385 rows x 3 columns]

Observation:

  • This functions shows the time the patient were admitted to ICU
In [14]:
dt=data.groupby('PATIENT_VISIT_IDENTIFIER',as_index=False).sum()['ICU'].reset_index()
admitted_to_ICU=[]
not_admitted_to_ICU= []

for i in dt['ICU']:
    if i ==0:
        not_admitted_to_ICU.append(i)
    elif i > 0:
        admitted_to_ICU.append(i)
In [15]:
len(admitted_to_ICU)
Out[15]:
195
In [16]:
len(not_admitted_to_ICU)
Out[16]:
190

Observation:

  • The Total number of patients is 385, 195 patients were admitted to ICU and 190 were not admitted to ICU.

We will now reorganise the patient data into a structure that is more akin to a time series and making unique ID appear in only one row.

In [17]:
#Define function to rearrange the data
admission_window_order = {
    '0-2': 1,
    '2-4': 2,''
    '4-6': 3,
    '6-12': 4,
    'ABOVE_12': 5}

def to_timeseries_format(data, position_dict):
    
    #Order dictionary
    position_dict = sorted(position_dict.items())
    
    #Split data
    df_list = []
    for position in position_dict:
        value, pos = position
        suffix = '_' + str(pos)
        df_list.append(data[data['WINDOW'] == value].add_suffix(suffix).reset_index(drop = True))
        
    #Reassemble data
    output_data = pd.concat(df_list, axis = 1)
    return output_data
In [18]:
#Converting the data into time series format
data = to_timeseries_format(data, admission_window_order)
data = data.drop(columns = ['PATIENT_VISIT_IDENTIFIER_2', 'PATIENT_VISIT_IDENTIFIER_3',
                            'PATIENT_VISIT_IDENTIFIER_4', 'PATIENT_VISIT_IDENTIFIER_5'])

Observation:

  • Creating a time series of 5 columns of the WINDOW columns and dropping the patient visit idenitifier, because it's a unique ID
In [19]:
data
Out[19]:
PATIENT_VISIT_IDENTIFIER_1 AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HTN_1 IMMUNOCOMPROMISED_1 OTHER_1 ALBUMIN_MEDIAN_1 ALBUMIN_MEAN_1 ALBUMIN_MIN_1 ALBUMIN_MAX_1 ALBUMIN_DIFF_1 BE_ARTERIAL_MEDIAN_1 BE_ARTERIAL_MEAN_1 BE_ARTERIAL_MIN_1 BE_ARTERIAL_MAX_1 BE_ARTERIAL_DIFF_1 BE_VENOUS_MEDIAN_1 BE_VENOUS_MEAN_1 BE_VENOUS_MIN_1 BE_VENOUS_MAX_1 BE_VENOUS_DIFF_1 BIC_ARTERIAL_MEDIAN_1 BIC_ARTERIAL_MEAN_1 BIC_ARTERIAL_MIN_1 BIC_ARTERIAL_MAX_1 BIC_ARTERIAL_DIFF_1 BIC_VENOUS_MEDIAN_1 BIC_VENOUS_MEAN_1 BIC_VENOUS_MIN_1 BIC_VENOUS_MAX_1 BIC_VENOUS_DIFF_1 BILLIRUBIN_MEDIAN_1 BILLIRUBIN_MEAN_1 BILLIRUBIN_MIN_1 BILLIRUBIN_MAX_1 BILLIRUBIN_DIFF_1 BLAST_MEDIAN_1 BLAST_MEAN_1 BLAST_MIN_1 BLAST_MAX_1 BLAST_DIFF_1 CALCIUM_MEDIAN_1 CALCIUM_MEAN_1 ... TTPA_MAX_5 TTPA_DIFF_5 UREA_MEDIAN_5 UREA_MEAN_5 UREA_MIN_5 UREA_MAX_5 UREA_DIFF_5 DIMER_MEDIAN_5 DIMER_MEAN_5 DIMER_MIN_5 DIMER_MAX_5 DIMER_DIFF_5 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_SISTOLIC_MEAN_5 HEART_RATE_MEAN_5 RESPIRATORY_RATE_MEAN_5 TEMPERATURE_MEAN_5 OXYGEN_SATURATION_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_SISTOLIC_MEDIAN_5 HEART_RATE_MEDIAN_5 RESPIRATORY_RATE_MEDIAN_5 TEMPERATURE_MEDIAN_5 OXYGEN_SATURATION_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_MIN_5 HEART_RATE_MIN_5 RESPIRATORY_RATE_MIN_5 TEMPERATURE_MIN_5 OXYGEN_SATURATION_MIN_5 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_SISTOLIC_MAX_5 HEART_RATE_MAX_5 RESPIRATORY_RATE_MAX_5 TEMPERATURE_MAX_5 OXYGEN_SATURATION_MAX_5 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_5 HEART_RATE_DIFF_5 RESPIRATORY_RATE_DIFF_5 TEMPERATURE_DIFF_5 OXYGEN_SATURATION_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 HEART_RATE_DIFF_REL_5 RESPIRATORY_RATE_DIFF_REL_5 TEMPERATURE_DIFF_REL_5 OXYGEN_SATURATION_DIFF_REL_5 WINDOW_5 ICU_5
0 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -1.0 -0.836145 -0.836145 -0.836145 -0.836145 -1.0 -0.996762 -0.996762 -0.996762 -0.996762 -1.0 -0.243021 -0.338537 -0.213031 -0.317859 0.033779 0.665932 -0.283951 -0.376923 -0.188679 -0.379310 0.035714 0.631579 -0.340206 -0.4875 -0.572650 -0.857143 0.098901 0.797980 -0.076923 0.286486 0.298507 0.272727 0.362319 0.947368 -0.339130 0.325153 0.114504 0.176471 -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 ABOVE_12 1
1 1 1 90th 1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.825613 -1.0 -0.460241 -0.460241 -0.460241 -0.460241 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.178122 0.212601 -0.141163 -0.380216 0.010915 0.841977 -0.185185 0.184615 -0.169811 -0.379310 0.000000 0.842105 -0.587629 -0.3250 -0.572650 -1.000000 0.010989 0.797980 0.555556 0.556757 0.298507 0.757576 0.710145 1.000000 0.513043 0.472393 0.114504 0.764706 0.142857 -0.797980 0.315690 0.200359 -0.239515 0.645161 0.139709 -0.802317 ABOVE_12 1
2 2 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.938950 -0.938950 -0.938950 -0.938950 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.357143 0.357143 ... -0.846633 -1.0 -0.927711 -0.927711 -0.927711 -0.927711 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.181070 -0.551603 -0.280660 -0.543785 0.057292 0.797149 -0.160494 -0.538462 -0.273585 -0.517241 0.107143 0.789474 -0.298969 -0.4500 -0.487179 -0.642857 0.142857 0.878788 -0.247863 -0.351351 -0.149254 -0.454545 0.101449 0.947368 -0.547826 -0.435583 -0.419847 -0.705882 -0.500000 -0.898990 -0.612422 -0.343258 -0.576744 -0.695341 -0.505464 -0.900129 ABOVE_12 1
3 3 0 40th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -1.0 -0.937349 -0.937349 -0.937349 -0.937349 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.002798 -0.546256 -0.270189 -0.535593 0.033571 0.694035 0.086420 -0.538462 -0.301887 -0.517241 -0.035714 0.736842 -0.381443 -0.6250 -0.521368 -0.857143 0.120879 0.171717 0.145299 -0.286486 0.477612 -0.272727 0.623188 1.000000 -0.078261 -0.190184 0.251908 -0.352941 -0.047619 -0.171717 -0.308696 -0.057718 -0.069094 -0.329749 -0.047619 -0.172436 ABOVE_12 0
4 4 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.935113 -0.935113 -0.935113 -0.935113 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.357143 0.357143 ... -0.846633 -1.0 -0.922892 -0.922892 -0.922892 -0.922892 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 0.290762 -0.074271 0.051399 -0.499708 0.040640 0.820327 0.333333 -0.076923 0.056604 -0.517241 0.071429 0.789474 0.030928 -0.1250 -0.230769 -0.500000 0.208791 0.898990 0.094017 -0.178378 0.104478 -0.454545 0.014493 0.894737 -0.478261 -0.558282 -0.389313 -0.823529 -0.642857 -0.939394 -0.652174 -0.596165 -0.634847 -0.817204 -0.645793 -0.940077 ABOVE_12 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
380 380 0 40th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.656676 -1.0 -0.879518 -0.879518 -0.879518 -0.879518 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.550154 -0.887500 0.395047 -0.582627 0.257812 0.659539 -0.481481 -0.830769 0.471698 -0.655172 0.285714 0.736842 -0.711340 -0.8875 -0.196581 -0.857143 0.318681 0.676768 -0.572650 -0.675676 0.432836 -0.272727 0.304348 0.894737 -0.530435 -0.374233 -0.083969 -0.352941 -0.523810 -0.717172 -0.505721 -0.119847 -0.553531 -0.245968 -0.535361 -0.717417 ABOVE_12 1
381 381 1 Above 90th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -0.399177 -0.584615 -0.396226 -0.570621 0.238095 0.754386 -0.407407 -0.538462 -0.490566 -0.517241 0.250000 0.789474 -0.175258 -0.3625 -0.350427 -0.571429 0.384615 0.878788 -0.572650 -0.675676 -0.373134 -0.575758 0.188406 0.789474 -0.982609 -0.889571 -0.770992 -0.882353 -0.690476 -0.959596 -0.982609 -0.871507 -0.804670 -0.878136 -0.697169 -0.960052 ABOVE_12 0
382 382 0 50th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.938950 -0.938950 -0.938950 -0.938950 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.357143 0.357143 ... -0.765668 -1.0 -0.855422 -0.855422 -0.855422 -0.855422 -1.0 -0.970474 -0.970474 -0.970474 -0.970474 -1.0 -0.089712 -0.522637 -0.226056 -0.409201 -0.029592 0.631579 -0.086420 -0.553846 -0.245283 -0.448276 -0.071429 0.631579 -0.340206 -0.4125 -0.333333 -0.714286 0.098901 0.777778 -0.145299 -0.243243 -0.059701 0.030303 0.043478 0.842105 -0.408696 -0.349693 -0.465649 -0.176471 -0.500000 -0.838384 -0.513996 -0.236377 -0.617378 -0.191851 -0.498615 -0.835052 ABOVE_12 1
383 383 0 40th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -1.0 -0.932530 -0.932530 -0.932530 -0.932530 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.083298 -0.478691 -0.190414 -0.541009 0.036125 0.705989 -0.160494 -0.538462 -0.188679 -0.517241 0.000000 0.736842 -0.175258 -0.3750 -0.418803 -0.714286 0.164835 0.797980 -0.076923 -0.470270 -0.029851 -0.393939 0.043478 0.894737 -0.478261 -0.644172 -0.358779 -0.588235 -0.571429 -0.838384 -0.552795 -0.585967 -0.557252 -0.573477 -0.572609 -0.838524 ABOVE_12 0
384 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -1.0 -0.807229 -0.807229 -0.807229 -0.807229 -1.0 -0.888448 -0.888448 -0.888448 -0.888448 -1.0 -0.185185 -0.539103 -0.107704 -0.610169 0.050595 0.662281 -0.160494 -0.538462 -0.075472 -0.586207 0.071429 0.631579 -0.175258 -0.3750 -0.247863 -0.785714 0.186813 0.777778 -0.247863 -0.470270 -0.149254 -0.515152 0.101449 0.842105 -0.652174 -0.644172 -0.633588 -0.647059 -0.547619 -0.838384 -0.701863 -0.585967 -0.763868 -0.612903 -0.551337 -0.835052 ABOVE_12 0

385 rows × 1151 columns

We have made progress in completing the data preparation. We will now eliminate the redundant variables in order to decrease our dataset. We are referring to characteristics like AGE PERCENTIL, 'AGE_ABOVE65', and GENDER that won't alter for a certain patient.

In [20]:
#Define function to remove redudant columns
def remove_redudant_cols(data, cols, range_begin, range_end):
    for n in range(range_begin, range_end+1):
        rm_cols = [x + '_'  + str(n) for x in cols]
        data = data.drop(columns = rm_cols)
        
    return data
In [21]:
#Remove redudant columns
redudant_cols = ['AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER', 'HTN', 'WINDOW']
data = remove_redudant_cols(data, redudant_cols, 2, 5).drop(columns = 'WINDOW_1')
In [22]:
data.head()
Out[22]:
PATIENT_VISIT_IDENTIFIER_1 AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HTN_1 IMMUNOCOMPROMISED_1 OTHER_1 ALBUMIN_MEDIAN_1 ALBUMIN_MEAN_1 ALBUMIN_MIN_1 ALBUMIN_MAX_1 ALBUMIN_DIFF_1 BE_ARTERIAL_MEDIAN_1 BE_ARTERIAL_MEAN_1 BE_ARTERIAL_MIN_1 BE_ARTERIAL_MAX_1 BE_ARTERIAL_DIFF_1 BE_VENOUS_MEDIAN_1 BE_VENOUS_MEAN_1 BE_VENOUS_MIN_1 BE_VENOUS_MAX_1 BE_VENOUS_DIFF_1 BIC_ARTERIAL_MEDIAN_1 BIC_ARTERIAL_MEAN_1 BIC_ARTERIAL_MIN_1 BIC_ARTERIAL_MAX_1 BIC_ARTERIAL_DIFF_1 BIC_VENOUS_MEDIAN_1 BIC_VENOUS_MEAN_1 BIC_VENOUS_MIN_1 BIC_VENOUS_MAX_1 BIC_VENOUS_DIFF_1 BILLIRUBIN_MEDIAN_1 BILLIRUBIN_MEAN_1 BILLIRUBIN_MIN_1 BILLIRUBIN_MAX_1 BILLIRUBIN_DIFF_1 BLAST_MEDIAN_1 BLAST_MEAN_1 BLAST_MIN_1 BLAST_MAX_1 BLAST_DIFF_1 CALCIUM_MEDIAN_1 CALCIUM_MEAN_1 ... TTPA_MIN_5 TTPA_MAX_5 TTPA_DIFF_5 UREA_MEDIAN_5 UREA_MEAN_5 UREA_MIN_5 UREA_MAX_5 UREA_DIFF_5 DIMER_MEDIAN_5 DIMER_MEAN_5 DIMER_MIN_5 DIMER_MAX_5 DIMER_DIFF_5 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_SISTOLIC_MEAN_5 HEART_RATE_MEAN_5 RESPIRATORY_RATE_MEAN_5 TEMPERATURE_MEAN_5 OXYGEN_SATURATION_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_SISTOLIC_MEDIAN_5 HEART_RATE_MEDIAN_5 RESPIRATORY_RATE_MEDIAN_5 TEMPERATURE_MEDIAN_5 OXYGEN_SATURATION_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_MIN_5 HEART_RATE_MIN_5 RESPIRATORY_RATE_MIN_5 TEMPERATURE_MIN_5 OXYGEN_SATURATION_MIN_5 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_SISTOLIC_MAX_5 HEART_RATE_MAX_5 RESPIRATORY_RATE_MAX_5 TEMPERATURE_MAX_5 OXYGEN_SATURATION_MAX_5 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_5 HEART_RATE_DIFF_5 RESPIRATORY_RATE_DIFF_5 TEMPERATURE_DIFF_5 OXYGEN_SATURATION_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 HEART_RATE_DIFF_REL_5 RESPIRATORY_RATE_DIFF_REL_5 TEMPERATURE_DIFF_REL_5 OXYGEN_SATURATION_DIFF_REL_5 ICU_5
0 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -0.846633 -1.0 -0.836145 -0.836145 -0.836145 -0.836145 -1.0 -0.996762 -0.996762 -0.996762 -0.996762 -1.0 -0.243021 -0.338537 -0.213031 -0.317859 0.033779 0.665932 -0.283951 -0.376923 -0.188679 -0.379310 0.035714 0.631579 -0.340206 -0.4875 -0.572650 -0.857143 0.098901 0.797980 -0.076923 0.286486 0.298507 0.272727 0.362319 0.947368 -0.339130 0.325153 0.114504 0.176471 -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 1
1 1 1 90th 1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.825613 -0.825613 -1.0 -0.460241 -0.460241 -0.460241 -0.460241 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.178122 0.212601 -0.141163 -0.380216 0.010915 0.841977 -0.185185 0.184615 -0.169811 -0.379310 0.000000 0.842105 -0.587629 -0.3250 -0.572650 -1.000000 0.010989 0.797980 0.555556 0.556757 0.298507 0.757576 0.710145 1.000000 0.513043 0.472393 0.114504 0.764706 0.142857 -0.797980 0.315690 0.200359 -0.239515 0.645161 0.139709 -0.802317 1
2 2 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.938950 -0.938950 -0.938950 -0.938950 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.357143 0.357143 ... -0.846633 -0.846633 -1.0 -0.927711 -0.927711 -0.927711 -0.927711 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.181070 -0.551603 -0.280660 -0.543785 0.057292 0.797149 -0.160494 -0.538462 -0.273585 -0.517241 0.107143 0.789474 -0.298969 -0.4500 -0.487179 -0.642857 0.142857 0.878788 -0.247863 -0.351351 -0.149254 -0.454545 0.101449 0.947368 -0.547826 -0.435583 -0.419847 -0.705882 -0.500000 -0.898990 -0.612422 -0.343258 -0.576744 -0.695341 -0.505464 -0.900129 1
3 3 0 40th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -0.846633 -0.846633 -1.0 -0.937349 -0.937349 -0.937349 -0.937349 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 -0.002798 -0.546256 -0.270189 -0.535593 0.033571 0.694035 0.086420 -0.538462 -0.301887 -0.517241 -0.035714 0.736842 -0.381443 -0.6250 -0.521368 -0.857143 0.120879 0.171717 0.145299 -0.286486 0.477612 -0.272727 0.623188 1.000000 -0.078261 -0.190184 0.251908 -0.352941 -0.047619 -0.171717 -0.308696 -0.057718 -0.069094 -0.329749 -0.047619 -0.172436 0
4 4 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.605263 0.605263 0.605263 0.605263 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.317073 -0.317073 -0.317073 -0.317073 -1.0 -0.935113 -0.935113 -0.935113 -0.935113 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.357143 0.357143 ... -0.846633 -0.846633 -1.0 -0.922892 -0.922892 -0.922892 -0.922892 -1.0 -0.978029 -0.978029 -0.978029 -0.978029 -1.0 0.290762 -0.074271 0.051399 -0.499708 0.040640 0.820327 0.333333 -0.076923 0.056604 -0.517241 0.071429 0.789474 0.030928 -0.1250 -0.230769 -0.500000 0.208791 0.898990 0.094017 -0.178378 0.104478 -0.454545 0.014493 0.894737 -0.478261 -0.558282 -0.389313 -0.823529 -0.642857 -0.939394 -0.652174 -0.596165 -0.634847 -0.817204 -0.645793 -0.940077 0

5 rows × 1130 columns

It is generally advisable to represent unique IDs in columns rather than rows in a timeseries dataset, representing the ID in a column allows for easier identification and organization of the data for each ID.

Data Cleaning and Pre-Processing ¶

In this section, we are basically going to do two things:

  1. Handle missing values;
  2. Reduce Data Where ICU = 1

Missing Values and Handling Missing Data¶

In [23]:
len(data)
data.isnull().sum().sum()
Out[23]:
223859
In [24]:
print('NaN values =', data.isnull().sum().sum())
print("""""")

vars_with_missing = []

for feature in data.columns:
    missings = data[feature].isna().sum()
    
    if missings > 0 :
        vars_with_missing.append(feature)
        missings_perc = missings / data.shape[0]
        
        print('Variable {} has {} records ({:.2%}) with missing values.'.format(feature, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
NaN values = 223859

Variable DISEASE GROUPING 1_1 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 2_1 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 3_1 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 4_1 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 5_1 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 6_1 has 1 records (0.26%) with missing values.
Variable HTN_1 has 1 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED_1 has 1 records (0.26%) with missing values.
Variable OTHER_1 has 1 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable ALBUMIN_MEAN_1 has 213 records (55.32%) with missing values.
Variable ALBUMIN_MIN_1 has 213 records (55.32%) with missing values.
Variable ALBUMIN_MAX_1 has 213 records (55.32%) with missing values.
Variable ALBUMIN_DIFF_1 has 213 records (55.32%) with missing values.
Variable BE_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BE_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable BE_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable BE_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable BE_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable BE_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BE_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable BE_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable BE_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable BE_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable BIC_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BIC_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable BIC_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable BIC_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable BIC_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable BIC_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BIC_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable BIC_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable BIC_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable BIC_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable BILLIRUBIN_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BILLIRUBIN_MEAN_1 has 213 records (55.32%) with missing values.
Variable BILLIRUBIN_MIN_1 has 213 records (55.32%) with missing values.
Variable BILLIRUBIN_MAX_1 has 213 records (55.32%) with missing values.
Variable BILLIRUBIN_DIFF_1 has 213 records (55.32%) with missing values.
Variable BLAST_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable BLAST_MEAN_1 has 213 records (55.32%) with missing values.
Variable BLAST_MIN_1 has 213 records (55.32%) with missing values.
Variable BLAST_MAX_1 has 213 records (55.32%) with missing values.
Variable BLAST_DIFF_1 has 213 records (55.32%) with missing values.
Variable CALCIUM_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable CALCIUM_MEAN_1 has 213 records (55.32%) with missing values.
Variable CALCIUM_MIN_1 has 213 records (55.32%) with missing values.
Variable CALCIUM_MAX_1 has 213 records (55.32%) with missing values.
Variable CALCIUM_DIFF_1 has 213 records (55.32%) with missing values.
Variable CREATININ_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable CREATININ_MEAN_1 has 213 records (55.32%) with missing values.
Variable CREATININ_MIN_1 has 213 records (55.32%) with missing values.
Variable CREATININ_MAX_1 has 213 records (55.32%) with missing values.
Variable CREATININ_DIFF_1 has 213 records (55.32%) with missing values.
Variable FFA_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable FFA_MEAN_1 has 213 records (55.32%) with missing values.
Variable FFA_MIN_1 has 213 records (55.32%) with missing values.
Variable FFA_MAX_1 has 213 records (55.32%) with missing values.
Variable FFA_DIFF_1 has 213 records (55.32%) with missing values.
Variable GGT_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable GGT_MEAN_1 has 213 records (55.32%) with missing values.
Variable GGT_MIN_1 has 213 records (55.32%) with missing values.
Variable GGT_MAX_1 has 213 records (55.32%) with missing values.
Variable GGT_DIFF_1 has 213 records (55.32%) with missing values.
Variable GLUCOSE_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable GLUCOSE_MEAN_1 has 213 records (55.32%) with missing values.
Variable GLUCOSE_MIN_1 has 213 records (55.32%) with missing values.
Variable GLUCOSE_MAX_1 has 213 records (55.32%) with missing values.
Variable GLUCOSE_DIFF_1 has 213 records (55.32%) with missing values.
Variable HEMATOCRITE_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable HEMATOCRITE_MEAN_1 has 213 records (55.32%) with missing values.
Variable HEMATOCRITE_MIN_1 has 213 records (55.32%) with missing values.
Variable HEMATOCRITE_MAX_1 has 213 records (55.32%) with missing values.
Variable HEMATOCRITE_DIFF_1 has 213 records (55.32%) with missing values.
Variable HEMOGLOBIN_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable HEMOGLOBIN_MEAN_1 has 213 records (55.32%) with missing values.
Variable HEMOGLOBIN_MIN_1 has 213 records (55.32%) with missing values.
Variable HEMOGLOBIN_MAX_1 has 213 records (55.32%) with missing values.
Variable HEMOGLOBIN_DIFF_1 has 213 records (55.32%) with missing values.
Variable INR_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable INR_MEAN_1 has 213 records (55.32%) with missing values.
Variable INR_MIN_1 has 213 records (55.32%) with missing values.
Variable INR_MAX_1 has 213 records (55.32%) with missing values.
Variable INR_DIFF_1 has 213 records (55.32%) with missing values.
Variable LACTATE_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable LACTATE_MEAN_1 has 213 records (55.32%) with missing values.
Variable LACTATE_MIN_1 has 213 records (55.32%) with missing values.
Variable LACTATE_MAX_1 has 213 records (55.32%) with missing values.
Variable LACTATE_DIFF_1 has 213 records (55.32%) with missing values.
Variable LEUKOCYTES_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable LEUKOCYTES_MEAN_1 has 213 records (55.32%) with missing values.
Variable LEUKOCYTES_MIN_1 has 213 records (55.32%) with missing values.
Variable LEUKOCYTES_MAX_1 has 213 records (55.32%) with missing values.
Variable LEUKOCYTES_DIFF_1 has 213 records (55.32%) with missing values.
Variable LINFOCITOS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable LINFOCITOS_MEAN_1 has 213 records (55.32%) with missing values.
Variable LINFOCITOS_MIN_1 has 213 records (55.32%) with missing values.
Variable LINFOCITOS_MAX_1 has 213 records (55.32%) with missing values.
Variable LINFOCITOS_DIFF_1 has 213 records (55.32%) with missing values.
Variable NEUTROPHILES_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable NEUTROPHILES_MEAN_1 has 213 records (55.32%) with missing values.
Variable NEUTROPHILES_MIN_1 has 213 records (55.32%) with missing values.
Variable NEUTROPHILES_MAX_1 has 213 records (55.32%) with missing values.
Variable NEUTROPHILES_DIFF_1 has 213 records (55.32%) with missing values.
Variable P02_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable P02_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable P02_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable P02_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable P02_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable P02_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable P02_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable P02_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable P02_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable P02_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable PC02_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PC02_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable PC02_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable PC02_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable PC02_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable PC02_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PC02_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable PC02_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable PC02_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable PC02_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable PCR_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PCR_MEAN_1 has 213 records (55.32%) with missing values.
Variable PCR_MIN_1 has 213 records (55.32%) with missing values.
Variable PCR_MAX_1 has 213 records (55.32%) with missing values.
Variable PCR_DIFF_1 has 213 records (55.32%) with missing values.
Variable PH_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PH_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable PH_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable PH_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable PH_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable PH_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PH_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable PH_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable PH_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable PH_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable PLATELETS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable PLATELETS_MEAN_1 has 213 records (55.32%) with missing values.
Variable PLATELETS_MIN_1 has 213 records (55.32%) with missing values.
Variable PLATELETS_MAX_1 has 213 records (55.32%) with missing values.
Variable PLATELETS_DIFF_1 has 213 records (55.32%) with missing values.
Variable POTASSIUM_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable POTASSIUM_MEAN_1 has 213 records (55.32%) with missing values.
Variable POTASSIUM_MIN_1 has 213 records (55.32%) with missing values.
Variable POTASSIUM_MAX_1 has 213 records (55.32%) with missing values.
Variable POTASSIUM_DIFF_1 has 213 records (55.32%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable SAT02_ARTERIAL_MEAN_1 has 213 records (55.32%) with missing values.
Variable SAT02_ARTERIAL_MIN_1 has 213 records (55.32%) with missing values.
Variable SAT02_ARTERIAL_MAX_1 has 213 records (55.32%) with missing values.
Variable SAT02_ARTERIAL_DIFF_1 has 213 records (55.32%) with missing values.
Variable SAT02_VENOUS_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable SAT02_VENOUS_MEAN_1 has 213 records (55.32%) with missing values.
Variable SAT02_VENOUS_MIN_1 has 213 records (55.32%) with missing values.
Variable SAT02_VENOUS_MAX_1 has 213 records (55.32%) with missing values.
Variable SAT02_VENOUS_DIFF_1 has 213 records (55.32%) with missing values.
Variable SODIUM_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable SODIUM_MEAN_1 has 213 records (55.32%) with missing values.
Variable SODIUM_MIN_1 has 213 records (55.32%) with missing values.
Variable SODIUM_MAX_1 has 213 records (55.32%) with missing values.
Variable SODIUM_DIFF_1 has 213 records (55.32%) with missing values.
Variable TGO_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable TGO_MEAN_1 has 213 records (55.32%) with missing values.
Variable TGO_MIN_1 has 213 records (55.32%) with missing values.
Variable TGO_MAX_1 has 213 records (55.32%) with missing values.
Variable TGO_DIFF_1 has 213 records (55.32%) with missing values.
Variable TGP_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable TGP_MEAN_1 has 213 records (55.32%) with missing values.
Variable TGP_MIN_1 has 213 records (55.32%) with missing values.
Variable TGP_MAX_1 has 213 records (55.32%) with missing values.
Variable TGP_DIFF_1 has 213 records (55.32%) with missing values.
Variable TTPA_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable TTPA_MEAN_1 has 213 records (55.32%) with missing values.
Variable TTPA_MIN_1 has 213 records (55.32%) with missing values.
Variable TTPA_MAX_1 has 213 records (55.32%) with missing values.
Variable TTPA_DIFF_1 has 213 records (55.32%) with missing values.
Variable UREA_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable UREA_MEAN_1 has 213 records (55.32%) with missing values.
Variable UREA_MIN_1 has 213 records (55.32%) with missing values.
Variable UREA_MAX_1 has 213 records (55.32%) with missing values.
Variable UREA_DIFF_1 has 213 records (55.32%) with missing values.
Variable DIMER_MEDIAN_1 has 213 records (55.32%) with missing values.
Variable DIMER_MEAN_1 has 213 records (55.32%) with missing values.
Variable DIMER_MIN_1 has 213 records (55.32%) with missing values.
Variable DIMER_MAX_1 has 213 records (55.32%) with missing values.
Variable DIMER_DIFF_1 has 213 records (55.32%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_MEAN_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_MEAN_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_MEAN_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_MEAN_1 has 256 records (66.49%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_MEDIAN_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_MEDIAN_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN_1 has 256 records (66.49%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_MIN_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_MIN_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_MIN_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_MIN_1 has 256 records (66.49%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_MAX_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_MAX_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_MAX_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_MAX_1 has 256 records (66.49%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_DIFF_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_DIFF_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_DIFF_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_DIFF_1 has 256 records (66.49%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 has 248 records (64.42%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL_1 has 248 records (64.42%) with missing values.
Variable HEART_RATE_DIFF_REL_1 has 252 records (65.45%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL_1 has 268 records (69.61%) with missing values.
Variable TEMPERATURE_DIFF_REL_1 has 263 records (68.31%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL_1 has 256 records (66.49%) with missing values.
Variable DISEASE GROUPING 1_2 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 2_2 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 3_2 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 4_2 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 5_2 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 6_2 has 1 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED_2 has 1 records (0.26%) with missing values.
Variable OTHER_2 has 1 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable ALBUMIN_MEAN_2 has 208 records (54.03%) with missing values.
Variable ALBUMIN_MIN_2 has 208 records (54.03%) with missing values.
Variable ALBUMIN_MAX_2 has 208 records (54.03%) with missing values.
Variable ALBUMIN_DIFF_2 has 208 records (54.03%) with missing values.
Variable BE_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BE_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable BE_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable BE_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable BE_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable BE_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BE_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable BE_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable BE_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable BE_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable BIC_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BIC_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable BIC_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable BIC_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable BIC_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable BIC_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BIC_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable BIC_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable BIC_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable BIC_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable BILLIRUBIN_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BILLIRUBIN_MEAN_2 has 208 records (54.03%) with missing values.
Variable BILLIRUBIN_MIN_2 has 208 records (54.03%) with missing values.
Variable BILLIRUBIN_MAX_2 has 208 records (54.03%) with missing values.
Variable BILLIRUBIN_DIFF_2 has 208 records (54.03%) with missing values.
Variable BLAST_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BLAST_MEAN_2 has 208 records (54.03%) with missing values.
Variable BLAST_MIN_2 has 208 records (54.03%) with missing values.
Variable BLAST_MAX_2 has 208 records (54.03%) with missing values.
Variable BLAST_DIFF_2 has 208 records (54.03%) with missing values.
Variable CALCIUM_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable CALCIUM_MEAN_2 has 208 records (54.03%) with missing values.
Variable CALCIUM_MIN_2 has 208 records (54.03%) with missing values.
Variable CALCIUM_MAX_2 has 208 records (54.03%) with missing values.
Variable CALCIUM_DIFF_2 has 208 records (54.03%) with missing values.
Variable CREATININ_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable CREATININ_MEAN_2 has 208 records (54.03%) with missing values.
Variable CREATININ_MIN_2 has 208 records (54.03%) with missing values.
Variable CREATININ_MAX_2 has 208 records (54.03%) with missing values.
Variable CREATININ_DIFF_2 has 208 records (54.03%) with missing values.
Variable FFA_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable FFA_MEAN_2 has 208 records (54.03%) with missing values.
Variable FFA_MIN_2 has 208 records (54.03%) with missing values.
Variable FFA_MAX_2 has 208 records (54.03%) with missing values.
Variable FFA_DIFF_2 has 208 records (54.03%) with missing values.
Variable GGT_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable GGT_MEAN_2 has 208 records (54.03%) with missing values.
Variable GGT_MIN_2 has 208 records (54.03%) with missing values.
Variable GGT_MAX_2 has 208 records (54.03%) with missing values.
Variable GGT_DIFF_2 has 208 records (54.03%) with missing values.
Variable GLUCOSE_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable GLUCOSE_MEAN_2 has 208 records (54.03%) with missing values.
Variable GLUCOSE_MIN_2 has 208 records (54.03%) with missing values.
Variable GLUCOSE_MAX_2 has 208 records (54.03%) with missing values.
Variable GLUCOSE_DIFF_2 has 208 records (54.03%) with missing values.
Variable HEMATOCRITE_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable HEMATOCRITE_MEAN_2 has 208 records (54.03%) with missing values.
Variable HEMATOCRITE_MIN_2 has 208 records (54.03%) with missing values.
Variable HEMATOCRITE_MAX_2 has 208 records (54.03%) with missing values.
Variable HEMATOCRITE_DIFF_2 has 208 records (54.03%) with missing values.
Variable HEMOGLOBIN_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable HEMOGLOBIN_MEAN_2 has 208 records (54.03%) with missing values.
Variable HEMOGLOBIN_MIN_2 has 208 records (54.03%) with missing values.
Variable HEMOGLOBIN_MAX_2 has 208 records (54.03%) with missing values.
Variable HEMOGLOBIN_DIFF_2 has 208 records (54.03%) with missing values.
Variable INR_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable INR_MEAN_2 has 208 records (54.03%) with missing values.
Variable INR_MIN_2 has 208 records (54.03%) with missing values.
Variable INR_MAX_2 has 208 records (54.03%) with missing values.
Variable INR_DIFF_2 has 208 records (54.03%) with missing values.
Variable LACTATE_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable LACTATE_MEAN_2 has 208 records (54.03%) with missing values.
Variable LACTATE_MIN_2 has 208 records (54.03%) with missing values.
Variable LACTATE_MAX_2 has 208 records (54.03%) with missing values.
Variable LACTATE_DIFF_2 has 208 records (54.03%) with missing values.
Variable LEUKOCYTES_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable LEUKOCYTES_MEAN_2 has 208 records (54.03%) with missing values.
Variable LEUKOCYTES_MIN_2 has 208 records (54.03%) with missing values.
Variable LEUKOCYTES_MAX_2 has 208 records (54.03%) with missing values.
Variable LEUKOCYTES_DIFF_2 has 208 records (54.03%) with missing values.
Variable LINFOCITOS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable LINFOCITOS_MEAN_2 has 208 records (54.03%) with missing values.
Variable LINFOCITOS_MIN_2 has 208 records (54.03%) with missing values.
Variable LINFOCITOS_MAX_2 has 208 records (54.03%) with missing values.
Variable LINFOCITOS_DIFF_2 has 208 records (54.03%) with missing values.
Variable NEUTROPHILES_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable NEUTROPHILES_MEAN_2 has 208 records (54.03%) with missing values.
Variable NEUTROPHILES_MIN_2 has 208 records (54.03%) with missing values.
Variable NEUTROPHILES_MAX_2 has 208 records (54.03%) with missing values.
Variable NEUTROPHILES_DIFF_2 has 208 records (54.03%) with missing values.
Variable P02_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable P02_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable P02_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable P02_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable P02_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable P02_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable P02_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable P02_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable P02_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable P02_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable PC02_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PC02_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable PC02_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable PC02_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable PC02_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable PC02_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PC02_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable PC02_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable PC02_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable PC02_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable PCR_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PCR_MEAN_2 has 208 records (54.03%) with missing values.
Variable PCR_MIN_2 has 208 records (54.03%) with missing values.
Variable PCR_MAX_2 has 208 records (54.03%) with missing values.
Variable PCR_DIFF_2 has 208 records (54.03%) with missing values.
Variable PH_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PH_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable PH_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable PH_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable PH_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable PH_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PH_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable PH_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable PH_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable PH_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable PLATELETS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable PLATELETS_MEAN_2 has 208 records (54.03%) with missing values.
Variable PLATELETS_MIN_2 has 208 records (54.03%) with missing values.
Variable PLATELETS_MAX_2 has 208 records (54.03%) with missing values.
Variable PLATELETS_DIFF_2 has 208 records (54.03%) with missing values.
Variable POTASSIUM_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable POTASSIUM_MEAN_2 has 208 records (54.03%) with missing values.
Variable POTASSIUM_MIN_2 has 208 records (54.03%) with missing values.
Variable POTASSIUM_MAX_2 has 208 records (54.03%) with missing values.
Variable POTASSIUM_DIFF_2 has 208 records (54.03%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable SAT02_ARTERIAL_MEAN_2 has 208 records (54.03%) with missing values.
Variable SAT02_ARTERIAL_MIN_2 has 208 records (54.03%) with missing values.
Variable SAT02_ARTERIAL_MAX_2 has 208 records (54.03%) with missing values.
Variable SAT02_ARTERIAL_DIFF_2 has 208 records (54.03%) with missing values.
Variable SAT02_VENOUS_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable SAT02_VENOUS_MEAN_2 has 208 records (54.03%) with missing values.
Variable SAT02_VENOUS_MIN_2 has 208 records (54.03%) with missing values.
Variable SAT02_VENOUS_MAX_2 has 208 records (54.03%) with missing values.
Variable SAT02_VENOUS_DIFF_2 has 208 records (54.03%) with missing values.
Variable SODIUM_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable SODIUM_MEAN_2 has 208 records (54.03%) with missing values.
Variable SODIUM_MIN_2 has 208 records (54.03%) with missing values.
Variable SODIUM_MAX_2 has 208 records (54.03%) with missing values.
Variable SODIUM_DIFF_2 has 208 records (54.03%) with missing values.
Variable TGO_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable TGO_MEAN_2 has 208 records (54.03%) with missing values.
Variable TGO_MIN_2 has 208 records (54.03%) with missing values.
Variable TGO_MAX_2 has 208 records (54.03%) with missing values.
Variable TGO_DIFF_2 has 208 records (54.03%) with missing values.
Variable TGP_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable TGP_MEAN_2 has 208 records (54.03%) with missing values.
Variable TGP_MIN_2 has 208 records (54.03%) with missing values.
Variable TGP_MAX_2 has 208 records (54.03%) with missing values.
Variable TGP_DIFF_2 has 208 records (54.03%) with missing values.
Variable TTPA_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable TTPA_MEAN_2 has 208 records (54.03%) with missing values.
Variable TTPA_MIN_2 has 208 records (54.03%) with missing values.
Variable TTPA_MAX_2 has 208 records (54.03%) with missing values.
Variable TTPA_DIFF_2 has 208 records (54.03%) with missing values.
Variable UREA_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable UREA_MEAN_2 has 208 records (54.03%) with missing values.
Variable UREA_MIN_2 has 208 records (54.03%) with missing values.
Variable UREA_MAX_2 has 208 records (54.03%) with missing values.
Variable UREA_DIFF_2 has 208 records (54.03%) with missing values.
Variable DIMER_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable DIMER_MEAN_2 has 208 records (54.03%) with missing values.
Variable DIMER_MIN_2 has 208 records (54.03%) with missing values.
Variable DIMER_MAX_2 has 208 records (54.03%) with missing values.
Variable DIMER_DIFF_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_MEAN_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_MEAN_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_MEAN_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_MEAN_2 has 212 records (55.06%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_MEDIAN_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_MEDIAN_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN_2 has 212 records (55.06%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_MIN_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_MIN_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_MIN_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_MIN_2 has 212 records (55.06%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_MAX_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_MAX_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_MAX_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_MAX_2 has 212 records (55.06%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_DIFF_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_DIFF_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_DIFF_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_DIFF_2 has 212 records (55.06%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL_2 has 208 records (54.03%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL_2 has 208 records (54.03%) with missing values.
Variable HEART_RATE_DIFF_REL_2 has 209 records (54.29%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL_2 has 231 records (60.00%) with missing values.
Variable TEMPERATURE_DIFF_REL_2 has 215 records (55.84%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL_2 has 212 records (55.06%) with missing values.
Variable DISEASE GROUPING 1_3 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 2_3 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 3_3 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 4_3 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 5_3 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 6_3 has 1 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED_3 has 1 records (0.26%) with missing values.
Variable OTHER_3 has 1 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable ALBUMIN_MEAN_3 has 343 records (89.09%) with missing values.
Variable ALBUMIN_MIN_3 has 343 records (89.09%) with missing values.
Variable ALBUMIN_MAX_3 has 343 records (89.09%) with missing values.
Variable ALBUMIN_DIFF_3 has 343 records (89.09%) with missing values.
Variable BE_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BE_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable BE_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable BE_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable BE_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable BE_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BE_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable BE_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable BE_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable BE_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable BIC_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BIC_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable BIC_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable BIC_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable BIC_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable BIC_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BIC_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable BIC_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable BIC_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable BIC_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable BILLIRUBIN_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BILLIRUBIN_MEAN_3 has 343 records (89.09%) with missing values.
Variable BILLIRUBIN_MIN_3 has 343 records (89.09%) with missing values.
Variable BILLIRUBIN_MAX_3 has 343 records (89.09%) with missing values.
Variable BILLIRUBIN_DIFF_3 has 343 records (89.09%) with missing values.
Variable BLAST_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable BLAST_MEAN_3 has 343 records (89.09%) with missing values.
Variable BLAST_MIN_3 has 343 records (89.09%) with missing values.
Variable BLAST_MAX_3 has 343 records (89.09%) with missing values.
Variable BLAST_DIFF_3 has 343 records (89.09%) with missing values.
Variable CALCIUM_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable CALCIUM_MEAN_3 has 343 records (89.09%) with missing values.
Variable CALCIUM_MIN_3 has 343 records (89.09%) with missing values.
Variable CALCIUM_MAX_3 has 343 records (89.09%) with missing values.
Variable CALCIUM_DIFF_3 has 343 records (89.09%) with missing values.
Variable CREATININ_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable CREATININ_MEAN_3 has 343 records (89.09%) with missing values.
Variable CREATININ_MIN_3 has 343 records (89.09%) with missing values.
Variable CREATININ_MAX_3 has 343 records (89.09%) with missing values.
Variable CREATININ_DIFF_3 has 343 records (89.09%) with missing values.
Variable FFA_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable FFA_MEAN_3 has 343 records (89.09%) with missing values.
Variable FFA_MIN_3 has 343 records (89.09%) with missing values.
Variable FFA_MAX_3 has 343 records (89.09%) with missing values.
Variable FFA_DIFF_3 has 343 records (89.09%) with missing values.
Variable GGT_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable GGT_MEAN_3 has 343 records (89.09%) with missing values.
Variable GGT_MIN_3 has 343 records (89.09%) with missing values.
Variable GGT_MAX_3 has 343 records (89.09%) with missing values.
Variable GGT_DIFF_3 has 343 records (89.09%) with missing values.
Variable GLUCOSE_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable GLUCOSE_MEAN_3 has 343 records (89.09%) with missing values.
Variable GLUCOSE_MIN_3 has 343 records (89.09%) with missing values.
Variable GLUCOSE_MAX_3 has 343 records (89.09%) with missing values.
Variable GLUCOSE_DIFF_3 has 343 records (89.09%) with missing values.
Variable HEMATOCRITE_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable HEMATOCRITE_MEAN_3 has 343 records (89.09%) with missing values.
Variable HEMATOCRITE_MIN_3 has 343 records (89.09%) with missing values.
Variable HEMATOCRITE_MAX_3 has 343 records (89.09%) with missing values.
Variable HEMATOCRITE_DIFF_3 has 343 records (89.09%) with missing values.
Variable HEMOGLOBIN_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable HEMOGLOBIN_MEAN_3 has 343 records (89.09%) with missing values.
Variable HEMOGLOBIN_MIN_3 has 343 records (89.09%) with missing values.
Variable HEMOGLOBIN_MAX_3 has 343 records (89.09%) with missing values.
Variable HEMOGLOBIN_DIFF_3 has 343 records (89.09%) with missing values.
Variable INR_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable INR_MEAN_3 has 343 records (89.09%) with missing values.
Variable INR_MIN_3 has 343 records (89.09%) with missing values.
Variable INR_MAX_3 has 343 records (89.09%) with missing values.
Variable INR_DIFF_3 has 343 records (89.09%) with missing values.
Variable LACTATE_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable LACTATE_MEAN_3 has 343 records (89.09%) with missing values.
Variable LACTATE_MIN_3 has 343 records (89.09%) with missing values.
Variable LACTATE_MAX_3 has 343 records (89.09%) with missing values.
Variable LACTATE_DIFF_3 has 343 records (89.09%) with missing values.
Variable LEUKOCYTES_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable LEUKOCYTES_MEAN_3 has 343 records (89.09%) with missing values.
Variable LEUKOCYTES_MIN_3 has 343 records (89.09%) with missing values.
Variable LEUKOCYTES_MAX_3 has 343 records (89.09%) with missing values.
Variable LEUKOCYTES_DIFF_3 has 343 records (89.09%) with missing values.
Variable LINFOCITOS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable LINFOCITOS_MEAN_3 has 343 records (89.09%) with missing values.
Variable LINFOCITOS_MIN_3 has 343 records (89.09%) with missing values.
Variable LINFOCITOS_MAX_3 has 343 records (89.09%) with missing values.
Variable LINFOCITOS_DIFF_3 has 343 records (89.09%) with missing values.
Variable NEUTROPHILES_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable NEUTROPHILES_MEAN_3 has 343 records (89.09%) with missing values.
Variable NEUTROPHILES_MIN_3 has 343 records (89.09%) with missing values.
Variable NEUTROPHILES_MAX_3 has 343 records (89.09%) with missing values.
Variable NEUTROPHILES_DIFF_3 has 343 records (89.09%) with missing values.
Variable P02_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable P02_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable P02_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable P02_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable P02_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable P02_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable P02_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable P02_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable P02_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable P02_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable PC02_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PC02_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable PC02_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable PC02_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable PC02_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable PC02_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PC02_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable PC02_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable PC02_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable PC02_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable PCR_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PCR_MEAN_3 has 343 records (89.09%) with missing values.
Variable PCR_MIN_3 has 343 records (89.09%) with missing values.
Variable PCR_MAX_3 has 343 records (89.09%) with missing values.
Variable PCR_DIFF_3 has 343 records (89.09%) with missing values.
Variable PH_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PH_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable PH_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable PH_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable PH_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable PH_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PH_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable PH_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable PH_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable PH_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable PLATELETS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable PLATELETS_MEAN_3 has 343 records (89.09%) with missing values.
Variable PLATELETS_MIN_3 has 343 records (89.09%) with missing values.
Variable PLATELETS_MAX_3 has 343 records (89.09%) with missing values.
Variable PLATELETS_DIFF_3 has 343 records (89.09%) with missing values.
Variable POTASSIUM_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable POTASSIUM_MEAN_3 has 343 records (89.09%) with missing values.
Variable POTASSIUM_MIN_3 has 343 records (89.09%) with missing values.
Variable POTASSIUM_MAX_3 has 343 records (89.09%) with missing values.
Variable POTASSIUM_DIFF_3 has 343 records (89.09%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable SAT02_ARTERIAL_MEAN_3 has 343 records (89.09%) with missing values.
Variable SAT02_ARTERIAL_MIN_3 has 343 records (89.09%) with missing values.
Variable SAT02_ARTERIAL_MAX_3 has 343 records (89.09%) with missing values.
Variable SAT02_ARTERIAL_DIFF_3 has 343 records (89.09%) with missing values.
Variable SAT02_VENOUS_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable SAT02_VENOUS_MEAN_3 has 343 records (89.09%) with missing values.
Variable SAT02_VENOUS_MIN_3 has 343 records (89.09%) with missing values.
Variable SAT02_VENOUS_MAX_3 has 343 records (89.09%) with missing values.
Variable SAT02_VENOUS_DIFF_3 has 343 records (89.09%) with missing values.
Variable SODIUM_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable SODIUM_MEAN_3 has 343 records (89.09%) with missing values.
Variable SODIUM_MIN_3 has 343 records (89.09%) with missing values.
Variable SODIUM_MAX_3 has 343 records (89.09%) with missing values.
Variable SODIUM_DIFF_3 has 343 records (89.09%) with missing values.
Variable TGO_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable TGO_MEAN_3 has 343 records (89.09%) with missing values.
Variable TGO_MIN_3 has 343 records (89.09%) with missing values.
Variable TGO_MAX_3 has 343 records (89.09%) with missing values.
Variable TGO_DIFF_3 has 343 records (89.09%) with missing values.
Variable TGP_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable TGP_MEAN_3 has 343 records (89.09%) with missing values.
Variable TGP_MIN_3 has 343 records (89.09%) with missing values.
Variable TGP_MAX_3 has 343 records (89.09%) with missing values.
Variable TGP_DIFF_3 has 343 records (89.09%) with missing values.
Variable TTPA_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable TTPA_MEAN_3 has 343 records (89.09%) with missing values.
Variable TTPA_MIN_3 has 343 records (89.09%) with missing values.
Variable TTPA_MAX_3 has 343 records (89.09%) with missing values.
Variable TTPA_DIFF_3 has 343 records (89.09%) with missing values.
Variable UREA_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable UREA_MEAN_3 has 343 records (89.09%) with missing values.
Variable UREA_MIN_3 has 343 records (89.09%) with missing values.
Variable UREA_MAX_3 has 343 records (89.09%) with missing values.
Variable UREA_DIFF_3 has 343 records (89.09%) with missing values.
Variable DIMER_MEDIAN_3 has 343 records (89.09%) with missing values.
Variable DIMER_MEAN_3 has 343 records (89.09%) with missing values.
Variable DIMER_MIN_3 has 343 records (89.09%) with missing values.
Variable DIMER_MAX_3 has 343 records (89.09%) with missing values.
Variable DIMER_DIFF_3 has 343 records (89.09%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_MEAN_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_MEAN_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_MEAN_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_MEAN_3 has 171 records (44.42%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_MEDIAN_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_MEDIAN_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN_3 has 171 records (44.42%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_MIN_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_MIN_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_MIN_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_MIN_3 has 171 records (44.42%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_MAX_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_MAX_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_MAX_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_MAX_3 has 171 records (44.42%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_DIFF_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_DIFF_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_DIFF_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_DIFF_3 has 171 records (44.42%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL_3 has 168 records (43.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL_3 has 168 records (43.64%) with missing values.
Variable HEART_RATE_DIFF_REL_3 has 169 records (43.90%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL_3 has 185 records (48.05%) with missing values.
Variable TEMPERATURE_DIFF_REL_3 has 169 records (43.90%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL_3 has 171 records (44.42%) with missing values.
Variable DISEASE GROUPING 1_4 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 2_4 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 3_4 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 4_4 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 5_4 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 6_4 has 1 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED_4 has 1 records (0.26%) with missing values.
Variable OTHER_4 has 1 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable ALBUMIN_MEAN_4 has 329 records (85.45%) with missing values.
Variable ALBUMIN_MIN_4 has 329 records (85.45%) with missing values.
Variable ALBUMIN_MAX_4 has 329 records (85.45%) with missing values.
Variable ALBUMIN_DIFF_4 has 329 records (85.45%) with missing values.
Variable BE_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BE_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable BE_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable BE_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable BE_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable BE_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BE_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable BE_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable BE_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable BE_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable BIC_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BIC_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable BIC_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable BIC_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable BIC_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable BIC_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BIC_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable BIC_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable BIC_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable BIC_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable BILLIRUBIN_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BILLIRUBIN_MEAN_4 has 329 records (85.45%) with missing values.
Variable BILLIRUBIN_MIN_4 has 329 records (85.45%) with missing values.
Variable BILLIRUBIN_MAX_4 has 329 records (85.45%) with missing values.
Variable BILLIRUBIN_DIFF_4 has 329 records (85.45%) with missing values.
Variable BLAST_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable BLAST_MEAN_4 has 329 records (85.45%) with missing values.
Variable BLAST_MIN_4 has 329 records (85.45%) with missing values.
Variable BLAST_MAX_4 has 329 records (85.45%) with missing values.
Variable BLAST_DIFF_4 has 329 records (85.45%) with missing values.
Variable CALCIUM_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable CALCIUM_MEAN_4 has 329 records (85.45%) with missing values.
Variable CALCIUM_MIN_4 has 329 records (85.45%) with missing values.
Variable CALCIUM_MAX_4 has 329 records (85.45%) with missing values.
Variable CALCIUM_DIFF_4 has 329 records (85.45%) with missing values.
Variable CREATININ_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable CREATININ_MEAN_4 has 329 records (85.45%) with missing values.
Variable CREATININ_MIN_4 has 329 records (85.45%) with missing values.
Variable CREATININ_MAX_4 has 329 records (85.45%) with missing values.
Variable CREATININ_DIFF_4 has 329 records (85.45%) with missing values.
Variable FFA_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable FFA_MEAN_4 has 329 records (85.45%) with missing values.
Variable FFA_MIN_4 has 329 records (85.45%) with missing values.
Variable FFA_MAX_4 has 329 records (85.45%) with missing values.
Variable FFA_DIFF_4 has 329 records (85.45%) with missing values.
Variable GGT_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable GGT_MEAN_4 has 329 records (85.45%) with missing values.
Variable GGT_MIN_4 has 329 records (85.45%) with missing values.
Variable GGT_MAX_4 has 329 records (85.45%) with missing values.
Variable GGT_DIFF_4 has 329 records (85.45%) with missing values.
Variable GLUCOSE_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable GLUCOSE_MEAN_4 has 329 records (85.45%) with missing values.
Variable GLUCOSE_MIN_4 has 329 records (85.45%) with missing values.
Variable GLUCOSE_MAX_4 has 329 records (85.45%) with missing values.
Variable GLUCOSE_DIFF_4 has 329 records (85.45%) with missing values.
Variable HEMATOCRITE_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable HEMATOCRITE_MEAN_4 has 329 records (85.45%) with missing values.
Variable HEMATOCRITE_MIN_4 has 329 records (85.45%) with missing values.
Variable HEMATOCRITE_MAX_4 has 329 records (85.45%) with missing values.
Variable HEMATOCRITE_DIFF_4 has 329 records (85.45%) with missing values.
Variable HEMOGLOBIN_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable HEMOGLOBIN_MEAN_4 has 329 records (85.45%) with missing values.
Variable HEMOGLOBIN_MIN_4 has 329 records (85.45%) with missing values.
Variable HEMOGLOBIN_MAX_4 has 329 records (85.45%) with missing values.
Variable HEMOGLOBIN_DIFF_4 has 329 records (85.45%) with missing values.
Variable INR_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable INR_MEAN_4 has 329 records (85.45%) with missing values.
Variable INR_MIN_4 has 329 records (85.45%) with missing values.
Variable INR_MAX_4 has 329 records (85.45%) with missing values.
Variable INR_DIFF_4 has 329 records (85.45%) with missing values.
Variable LACTATE_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable LACTATE_MEAN_4 has 329 records (85.45%) with missing values.
Variable LACTATE_MIN_4 has 329 records (85.45%) with missing values.
Variable LACTATE_MAX_4 has 329 records (85.45%) with missing values.
Variable LACTATE_DIFF_4 has 329 records (85.45%) with missing values.
Variable LEUKOCYTES_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable LEUKOCYTES_MEAN_4 has 329 records (85.45%) with missing values.
Variable LEUKOCYTES_MIN_4 has 329 records (85.45%) with missing values.
Variable LEUKOCYTES_MAX_4 has 329 records (85.45%) with missing values.
Variable LEUKOCYTES_DIFF_4 has 329 records (85.45%) with missing values.
Variable LINFOCITOS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable LINFOCITOS_MEAN_4 has 329 records (85.45%) with missing values.
Variable LINFOCITOS_MIN_4 has 329 records (85.45%) with missing values.
Variable LINFOCITOS_MAX_4 has 329 records (85.45%) with missing values.
Variable LINFOCITOS_DIFF_4 has 329 records (85.45%) with missing values.
Variable NEUTROPHILES_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable NEUTROPHILES_MEAN_4 has 329 records (85.45%) with missing values.
Variable NEUTROPHILES_MIN_4 has 329 records (85.45%) with missing values.
Variable NEUTROPHILES_MAX_4 has 329 records (85.45%) with missing values.
Variable NEUTROPHILES_DIFF_4 has 329 records (85.45%) with missing values.
Variable P02_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable P02_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable P02_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable P02_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable P02_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable P02_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable P02_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable P02_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable P02_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable P02_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable PC02_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PC02_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable PC02_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable PC02_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable PC02_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable PC02_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PC02_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable PC02_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable PC02_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable PC02_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable PCR_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PCR_MEAN_4 has 329 records (85.45%) with missing values.
Variable PCR_MIN_4 has 329 records (85.45%) with missing values.
Variable PCR_MAX_4 has 329 records (85.45%) with missing values.
Variable PCR_DIFF_4 has 329 records (85.45%) with missing values.
Variable PH_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PH_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable PH_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable PH_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable PH_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable PH_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PH_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable PH_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable PH_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable PH_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable PLATELETS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable PLATELETS_MEAN_4 has 329 records (85.45%) with missing values.
Variable PLATELETS_MIN_4 has 329 records (85.45%) with missing values.
Variable PLATELETS_MAX_4 has 329 records (85.45%) with missing values.
Variable PLATELETS_DIFF_4 has 329 records (85.45%) with missing values.
Variable POTASSIUM_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable POTASSIUM_MEAN_4 has 329 records (85.45%) with missing values.
Variable POTASSIUM_MIN_4 has 329 records (85.45%) with missing values.
Variable POTASSIUM_MAX_4 has 329 records (85.45%) with missing values.
Variable POTASSIUM_DIFF_4 has 329 records (85.45%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable SAT02_ARTERIAL_MEAN_4 has 329 records (85.45%) with missing values.
Variable SAT02_ARTERIAL_MIN_4 has 329 records (85.45%) with missing values.
Variable SAT02_ARTERIAL_MAX_4 has 329 records (85.45%) with missing values.
Variable SAT02_ARTERIAL_DIFF_4 has 329 records (85.45%) with missing values.
Variable SAT02_VENOUS_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable SAT02_VENOUS_MEAN_4 has 329 records (85.45%) with missing values.
Variable SAT02_VENOUS_MIN_4 has 329 records (85.45%) with missing values.
Variable SAT02_VENOUS_MAX_4 has 329 records (85.45%) with missing values.
Variable SAT02_VENOUS_DIFF_4 has 329 records (85.45%) with missing values.
Variable SODIUM_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable SODIUM_MEAN_4 has 329 records (85.45%) with missing values.
Variable SODIUM_MIN_4 has 329 records (85.45%) with missing values.
Variable SODIUM_MAX_4 has 329 records (85.45%) with missing values.
Variable SODIUM_DIFF_4 has 329 records (85.45%) with missing values.
Variable TGO_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable TGO_MEAN_4 has 329 records (85.45%) with missing values.
Variable TGO_MIN_4 has 329 records (85.45%) with missing values.
Variable TGO_MAX_4 has 329 records (85.45%) with missing values.
Variable TGO_DIFF_4 has 329 records (85.45%) with missing values.
Variable TGP_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable TGP_MEAN_4 has 329 records (85.45%) with missing values.
Variable TGP_MIN_4 has 329 records (85.45%) with missing values.
Variable TGP_MAX_4 has 329 records (85.45%) with missing values.
Variable TGP_DIFF_4 has 329 records (85.45%) with missing values.
Variable TTPA_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable TTPA_MEAN_4 has 329 records (85.45%) with missing values.
Variable TTPA_MIN_4 has 329 records (85.45%) with missing values.
Variable TTPA_MAX_4 has 329 records (85.45%) with missing values.
Variable TTPA_DIFF_4 has 329 records (85.45%) with missing values.
Variable UREA_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable UREA_MEAN_4 has 329 records (85.45%) with missing values.
Variable UREA_MIN_4 has 329 records (85.45%) with missing values.
Variable UREA_MAX_4 has 329 records (85.45%) with missing values.
Variable UREA_DIFF_4 has 329 records (85.45%) with missing values.
Variable DIMER_MEDIAN_4 has 329 records (85.45%) with missing values.
Variable DIMER_MEAN_4 has 329 records (85.45%) with missing values.
Variable DIMER_MIN_4 has 329 records (85.45%) with missing values.
Variable DIMER_MAX_4 has 329 records (85.45%) with missing values.
Variable DIMER_DIFF_4 has 329 records (85.45%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_MEAN_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_MEAN_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_MEAN_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_MEAN_4 has 46 records (11.95%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_MEDIAN_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_MEDIAN_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN_4 has 46 records (11.95%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_MIN_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_MIN_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_MIN_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_MIN_4 has 46 records (11.95%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_MAX_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_MAX_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_MAX_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_MAX_4 has 46 records (11.95%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_DIFF_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_DIFF_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_DIFF_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_DIFF_4 has 46 records (11.95%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL_4 has 60 records (15.58%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL_4 has 60 records (15.58%) with missing values.
Variable HEART_RATE_DIFF_REL_4 has 54 records (14.03%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL_4 has 63 records (16.36%) with missing values.
Variable TEMPERATURE_DIFF_REL_4 has 46 records (11.95%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL_4 has 46 records (11.95%) with missing values.
Variable DISEASE GROUPING 1_5 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 2_5 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 3_5 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 4_5 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 5_5 has 1 records (0.26%) with missing values.
Variable DISEASE GROUPING 6_5 has 1 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED_5 has 1 records (0.26%) with missing values.
Variable OTHER_5 has 1 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable ALBUMIN_MEAN_5 has 11 records (2.86%) with missing values.
Variable ALBUMIN_MIN_5 has 11 records (2.86%) with missing values.
Variable ALBUMIN_MAX_5 has 11 records (2.86%) with missing values.
Variable ALBUMIN_DIFF_5 has 11 records (2.86%) with missing values.
Variable BE_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BE_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable BE_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable BE_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable BE_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable BE_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BE_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable BE_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable BE_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable BE_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable BIC_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BIC_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable BIC_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable BIC_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable BIC_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable BIC_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BIC_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable BIC_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable BIC_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable BIC_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable BILLIRUBIN_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BILLIRUBIN_MEAN_5 has 11 records (2.86%) with missing values.
Variable BILLIRUBIN_MIN_5 has 11 records (2.86%) with missing values.
Variable BILLIRUBIN_MAX_5 has 11 records (2.86%) with missing values.
Variable BILLIRUBIN_DIFF_5 has 11 records (2.86%) with missing values.
Variable BLAST_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable BLAST_MEAN_5 has 11 records (2.86%) with missing values.
Variable BLAST_MIN_5 has 11 records (2.86%) with missing values.
Variable BLAST_MAX_5 has 11 records (2.86%) with missing values.
Variable BLAST_DIFF_5 has 11 records (2.86%) with missing values.
Variable CALCIUM_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable CALCIUM_MEAN_5 has 11 records (2.86%) with missing values.
Variable CALCIUM_MIN_5 has 11 records (2.86%) with missing values.
Variable CALCIUM_MAX_5 has 11 records (2.86%) with missing values.
Variable CALCIUM_DIFF_5 has 11 records (2.86%) with missing values.
Variable CREATININ_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable CREATININ_MEAN_5 has 11 records (2.86%) with missing values.
Variable CREATININ_MIN_5 has 11 records (2.86%) with missing values.
Variable CREATININ_MAX_5 has 11 records (2.86%) with missing values.
Variable CREATININ_DIFF_5 has 11 records (2.86%) with missing values.
Variable FFA_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable FFA_MEAN_5 has 11 records (2.86%) with missing values.
Variable FFA_MIN_5 has 11 records (2.86%) with missing values.
Variable FFA_MAX_5 has 11 records (2.86%) with missing values.
Variable FFA_DIFF_5 has 11 records (2.86%) with missing values.
Variable GGT_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable GGT_MEAN_5 has 11 records (2.86%) with missing values.
Variable GGT_MIN_5 has 11 records (2.86%) with missing values.
Variable GGT_MAX_5 has 11 records (2.86%) with missing values.
Variable GGT_DIFF_5 has 11 records (2.86%) with missing values.
Variable GLUCOSE_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable GLUCOSE_MEAN_5 has 11 records (2.86%) with missing values.
Variable GLUCOSE_MIN_5 has 11 records (2.86%) with missing values.
Variable GLUCOSE_MAX_5 has 11 records (2.86%) with missing values.
Variable GLUCOSE_DIFF_5 has 11 records (2.86%) with missing values.
Variable HEMATOCRITE_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable HEMATOCRITE_MEAN_5 has 11 records (2.86%) with missing values.
Variable HEMATOCRITE_MIN_5 has 11 records (2.86%) with missing values.
Variable HEMATOCRITE_MAX_5 has 11 records (2.86%) with missing values.
Variable HEMATOCRITE_DIFF_5 has 11 records (2.86%) with missing values.
Variable HEMOGLOBIN_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable HEMOGLOBIN_MEAN_5 has 11 records (2.86%) with missing values.
Variable HEMOGLOBIN_MIN_5 has 11 records (2.86%) with missing values.
Variable HEMOGLOBIN_MAX_5 has 11 records (2.86%) with missing values.
Variable HEMOGLOBIN_DIFF_5 has 11 records (2.86%) with missing values.
Variable INR_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable INR_MEAN_5 has 11 records (2.86%) with missing values.
Variable INR_MIN_5 has 11 records (2.86%) with missing values.
Variable INR_MAX_5 has 11 records (2.86%) with missing values.
Variable INR_DIFF_5 has 11 records (2.86%) with missing values.
Variable LACTATE_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable LACTATE_MEAN_5 has 11 records (2.86%) with missing values.
Variable LACTATE_MIN_5 has 11 records (2.86%) with missing values.
Variable LACTATE_MAX_5 has 11 records (2.86%) with missing values.
Variable LACTATE_DIFF_5 has 11 records (2.86%) with missing values.
Variable LEUKOCYTES_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable LEUKOCYTES_MEAN_5 has 11 records (2.86%) with missing values.
Variable LEUKOCYTES_MIN_5 has 11 records (2.86%) with missing values.
Variable LEUKOCYTES_MAX_5 has 11 records (2.86%) with missing values.
Variable LEUKOCYTES_DIFF_5 has 11 records (2.86%) with missing values.
Variable LINFOCITOS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable LINFOCITOS_MEAN_5 has 11 records (2.86%) with missing values.
Variable LINFOCITOS_MIN_5 has 11 records (2.86%) with missing values.
Variable LINFOCITOS_MAX_5 has 11 records (2.86%) with missing values.
Variable LINFOCITOS_DIFF_5 has 11 records (2.86%) with missing values.
Variable NEUTROPHILES_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable NEUTROPHILES_MEAN_5 has 11 records (2.86%) with missing values.
Variable NEUTROPHILES_MIN_5 has 11 records (2.86%) with missing values.
Variable NEUTROPHILES_MAX_5 has 11 records (2.86%) with missing values.
Variable NEUTROPHILES_DIFF_5 has 11 records (2.86%) with missing values.
Variable P02_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable P02_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable P02_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable P02_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable P02_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable P02_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable P02_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable P02_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable P02_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable P02_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable PC02_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PC02_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable PC02_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable PC02_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable PC02_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable PC02_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PC02_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable PC02_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable PC02_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable PC02_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable PCR_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PCR_MEAN_5 has 11 records (2.86%) with missing values.
Variable PCR_MIN_5 has 11 records (2.86%) with missing values.
Variable PCR_MAX_5 has 11 records (2.86%) with missing values.
Variable PCR_DIFF_5 has 11 records (2.86%) with missing values.
Variable PH_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PH_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable PH_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable PH_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable PH_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable PH_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PH_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable PH_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable PH_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable PH_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable PLATELETS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable PLATELETS_MEAN_5 has 11 records (2.86%) with missing values.
Variable PLATELETS_MIN_5 has 11 records (2.86%) with missing values.
Variable PLATELETS_MAX_5 has 11 records (2.86%) with missing values.
Variable PLATELETS_DIFF_5 has 11 records (2.86%) with missing values.
Variable POTASSIUM_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable POTASSIUM_MEAN_5 has 11 records (2.86%) with missing values.
Variable POTASSIUM_MIN_5 has 11 records (2.86%) with missing values.
Variable POTASSIUM_MAX_5 has 11 records (2.86%) with missing values.
Variable POTASSIUM_DIFF_5 has 11 records (2.86%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable SAT02_ARTERIAL_MEAN_5 has 11 records (2.86%) with missing values.
Variable SAT02_ARTERIAL_MIN_5 has 11 records (2.86%) with missing values.
Variable SAT02_ARTERIAL_MAX_5 has 11 records (2.86%) with missing values.
Variable SAT02_ARTERIAL_DIFF_5 has 11 records (2.86%) with missing values.
Variable SAT02_VENOUS_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable SAT02_VENOUS_MEAN_5 has 11 records (2.86%) with missing values.
Variable SAT02_VENOUS_MIN_5 has 11 records (2.86%) with missing values.
Variable SAT02_VENOUS_MAX_5 has 11 records (2.86%) with missing values.
Variable SAT02_VENOUS_DIFF_5 has 11 records (2.86%) with missing values.
Variable SODIUM_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable SODIUM_MEAN_5 has 11 records (2.86%) with missing values.
Variable SODIUM_MIN_5 has 11 records (2.86%) with missing values.
Variable SODIUM_MAX_5 has 11 records (2.86%) with missing values.
Variable SODIUM_DIFF_5 has 11 records (2.86%) with missing values.
Variable TGO_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable TGO_MEAN_5 has 11 records (2.86%) with missing values.
Variable TGO_MIN_5 has 11 records (2.86%) with missing values.
Variable TGO_MAX_5 has 11 records (2.86%) with missing values.
Variable TGO_DIFF_5 has 11 records (2.86%) with missing values.
Variable TGP_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable TGP_MEAN_5 has 11 records (2.86%) with missing values.
Variable TGP_MIN_5 has 11 records (2.86%) with missing values.
Variable TGP_MAX_5 has 11 records (2.86%) with missing values.
Variable TGP_DIFF_5 has 11 records (2.86%) with missing values.
Variable TTPA_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable TTPA_MEAN_5 has 11 records (2.86%) with missing values.
Variable TTPA_MIN_5 has 11 records (2.86%) with missing values.
Variable TTPA_MAX_5 has 11 records (2.86%) with missing values.
Variable TTPA_DIFF_5 has 11 records (2.86%) with missing values.
Variable UREA_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable UREA_MEAN_5 has 11 records (2.86%) with missing values.
Variable UREA_MIN_5 has 11 records (2.86%) with missing values.
Variable UREA_MAX_5 has 11 records (2.86%) with missing values.
Variable UREA_DIFF_5 has 11 records (2.86%) with missing values.
Variable DIMER_MEDIAN_5 has 11 records (2.86%) with missing values.
Variable DIMER_MEAN_5 has 11 records (2.86%) with missing values.
Variable DIMER_MIN_5 has 11 records (2.86%) with missing values.
Variable DIMER_MAX_5 has 11 records (2.86%) with missing values.
Variable DIMER_DIFF_5 has 11 records (2.86%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_MEAN_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_MEAN_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_MEAN_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_MEAN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_MIN_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_MIN_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_MIN_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_MIN_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_MAX_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_MAX_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_MAX_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_MAX_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_DIFF_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_DIFF_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_DIFF_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_DIFF_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 has 1 records (0.26%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL_5 has 1 records (0.26%) with missing values.
Variable HEART_RATE_DIFF_REL_5 has 1 records (0.26%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL_5 has 1 records (0.26%) with missing values.
Variable TEMPERATURE_DIFF_REL_5 has 1 records (0.26%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL_5 has 1 records (0.26%) with missing values.
In total, there are 1121 variables with missing values
In [25]:
#Identify features with missing values as a percentage of the total number of records
null_values = data.isnull().sum()
null_values = 100 * null_values[null_values > 0] / len(data)
# Sort the values in descending order
null_values.sort_values(ascending = False)
Out[25]:
P02_VENOUS_MAX_3                89.090909
FFA_MEDIAN_3                    89.090909
HEMOGLOBIN_MEDIAN_3             89.090909
HEMATOCRITE_DIFF_3              89.090909
HEMATOCRITE_MAX_3               89.090909
                                  ...    
DISEASE GROUPING 1_4             0.259740
DISEASE GROUPING 2_1             0.259740
OTHER_3                          0.259740
IMMUNOCOMPROMISED_3              0.259740
OXYGEN_SATURATION_DIFF_REL_5     0.259740
Length: 1121, dtype: float64

As we can see, certain features have virtually no missing values, whilst others are almost entirely composed of nulls. However, given these attributes have a time component, we shouldn't be hasty to ignore this data.

In [26]:
over_50pct_nulls = null_values[null_values > 50]
print(over_50pct_nulls.sort_index())
ALBUMIN_DIFF_1    55.324675
ALBUMIN_DIFF_2    54.025974
ALBUMIN_DIFF_3    89.090909
ALBUMIN_DIFF_4    85.454545
ALBUMIN_MAX_1     55.324675
                    ...    
UREA_MEDIAN_4     85.454545
UREA_MIN_1        55.324675
UREA_MIN_2        54.025974
UREA_MIN_3        89.090909
UREA_MIN_4        85.454545
Length: 792, dtype: float64

Observation:

  • We have 792 columns with over 50% missing data
In [27]:
#Identifying which feature groups are mostly composed of missing values
over_50pct_nulls = over_50pct_nulls.reset_index(level = 0).rename(columns = {'index': 'Feature', 0: 'Null_Pct'})
over_50pct_nulls['Feature_Group'] = [x[:-2] for x in over_50pct_nulls['Feature']]

null_aggregate = over_50pct_nulls.groupby(by = 'Feature_Group').agg({'Null_Pct': ['count', 'min', 'max']})
print(null_aggregate)
               Null_Pct                      
                  count        min        max
Feature_Group                                
ALBUMIN_DIFF          4  54.025974  89.090909
ALBUMIN_MAX           4  54.025974  89.090909
ALBUMIN_MEAN          4  54.025974  89.090909
ALBUMIN_MEDIAN        4  54.025974  89.090909
ALBUMIN_MIN           4  54.025974  89.090909
...                 ...        ...        ...
UREA_DIFF             4  54.025974  89.090909
UREA_MAX              4  54.025974  89.090909
UREA_MEAN             4  54.025974  89.090909
UREA_MEDIAN           4  54.025974  89.090909
UREA_MIN              4  54.025974  89.090909

[216 rows x 3 columns]
In [28]:
null_aggregate.columns = ['_'.join(col).strip() for col in null_aggregate.columns.values]
null_aggregate.query('Null_Pct_count == 4 & Null_Pct_min > 50')
Out[28]:
Null_Pct_count Null_Pct_min Null_Pct_max
Feature_Group
ALBUMIN_DIFF 4 54.025974 89.090909
ALBUMIN_MAX 4 54.025974 89.090909
ALBUMIN_MEAN 4 54.025974 89.090909
ALBUMIN_MEDIAN 4 54.025974 89.090909
ALBUMIN_MIN 4 54.025974 89.090909
... ... ... ...
UREA_DIFF 4 54.025974 89.090909
UREA_MAX 4 54.025974 89.090909
UREA_MEAN 4 54.025974 89.090909
UREA_MEDIAN 4 54.025974 89.090909
UREA_MIN 4 54.025974 89.090909

180 rows × 3 columns

These findings indicate that more nulls than genuine values are present in 180 feature groupings. The linked functionalities will simply be removed as our initial strategy.

In [29]:
#Remove columns with overrepresentation of null values
over_50pct_nulls = null_aggregate.query('Null_Pct_count == 4 & Null_Pct_min > 50').index

for n in range(1,6):
    remove_cols = [x + '_' + str(n) for x in over_50pct_nulls]
    data = data.drop(columns = remove_cols)
    
data.head()
Out[29]:
PATIENT_VISIT_IDENTIFIER_1 AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HTN_1 IMMUNOCOMPROMISED_1 OTHER_1 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEAN_1 HEART_RATE_MEAN_1 RESPIRATORY_RATE_MEAN_1 TEMPERATURE_MEAN_1 OXYGEN_SATURATION_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 HEART_RATE_MEDIAN_1 RESPIRATORY_RATE_MEDIAN_1 TEMPERATURE_MEDIAN_1 OXYGEN_SATURATION_MEDIAN_1 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_SISTOLIC_MIN_1 HEART_RATE_MIN_1 RESPIRATORY_RATE_MIN_1 TEMPERATURE_MIN_1 OXYGEN_SATURATION_MIN_1 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MAX_1 HEART_RATE_MAX_1 RESPIRATORY_RATE_MAX_1 TEMPERATURE_MAX_1 OXYGEN_SATURATION_MAX_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_1 HEART_RATE_DIFF_1 RESPIRATORY_RATE_DIFF_1 TEMPERATURE_DIFF_1 OXYGEN_SATURATION_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 HEART_RATE_DIFF_REL_1 RESPIRATORY_RATE_DIFF_REL_1 TEMPERATURE_DIFF_REL_1 OXYGEN_SATURATION_DIFF_REL_1 ICU_1 ... HEART_RATE_DIFF_REL_4 RESPIRATORY_RATE_DIFF_REL_4 TEMPERATURE_DIFF_REL_4 OXYGEN_SATURATION_DIFF_REL_4 ICU_4 DISEASE GROUPING 1_5 DISEASE GROUPING 2_5 DISEASE GROUPING 3_5 DISEASE GROUPING 4_5 DISEASE GROUPING 5_5 DISEASE GROUPING 6_5 IMMUNOCOMPROMISED_5 OTHER_5 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_SISTOLIC_MEAN_5 HEART_RATE_MEAN_5 RESPIRATORY_RATE_MEAN_5 TEMPERATURE_MEAN_5 OXYGEN_SATURATION_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_SISTOLIC_MEDIAN_5 HEART_RATE_MEDIAN_5 RESPIRATORY_RATE_MEDIAN_5 TEMPERATURE_MEDIAN_5 OXYGEN_SATURATION_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_MIN_5 HEART_RATE_MIN_5 RESPIRATORY_RATE_MIN_5 TEMPERATURE_MIN_5 OXYGEN_SATURATION_MIN_5 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_SISTOLIC_MAX_5 HEART_RATE_MAX_5 RESPIRATORY_RATE_MAX_5 TEMPERATURE_MAX_5 OXYGEN_SATURATION_MAX_5 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_5 HEART_RATE_DIFF_5 RESPIRATORY_RATE_DIFF_5 TEMPERATURE_DIFF_5 OXYGEN_SATURATION_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 HEART_RATE_DIFF_REL_5 RESPIRATORY_RATE_DIFF_REL_5 TEMPERATURE_DIFF_REL_5 OXYGEN_SATURATION_DIFF_REL_5 ICU_5
0 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.086420 -0.230769 -0.283019 -0.593220 -0.285714 0.736842 0.086420 -0.230769 -0.283019 -0.586207 -0.285714 0.736842 0.237113 0.00 -0.162393 -0.5 0.208791 0.89899 -0.247863 -0.459459 -0.432836 -0.636364 -0.420290 0.736842 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0 ... NaN NaN -1.000000 -1.000000 0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 -0.243021 -0.338537 -0.213031 -0.317859 0.033779 0.665932 -0.283951 -0.376923 -0.188679 -0.379310 0.035714 0.631579 -0.340206 -0.4875 -0.572650 -0.857143 0.098901 0.797980 -0.076923 0.286486 0.298507 0.272727 0.362319 0.947368 -0.339130 0.325153 0.114504 0.176471 -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 1
1 1 1 90th 1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 -0.283951 -0.046154 0.188679 0.830508 -0.107143 1.000000 -0.283951 -0.046154 0.188679 0.862069 -0.107143 1.000000 -0.072165 0.15 0.264957 1.0 0.318681 1.00000 -0.504274 -0.329730 -0.059701 0.636364 -0.275362 1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1 ... -0.940967 -0.817204 -0.882574 -1.000000 1 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 -0.178122 0.212601 -0.141163 -0.380216 0.010915 0.841977 -0.185185 0.184615 -0.169811 -0.379310 0.000000 0.842105 -0.587629 -0.3250 -0.572650 -1.000000 0.010989 0.797980 0.555556 0.556757 0.298507 0.757576 0.710145 1.000000 0.513043 0.472393 0.114504 0.764706 0.142857 -0.797980 0.315690 0.200359 -0.239515 0.645161 0.139709 -0.802317 1
2 2 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 ... -0.721834 -0.926882 -1.000000 -0.801293 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -0.181070 -0.551603 -0.280660 -0.543785 0.057292 0.797149 -0.160494 -0.538462 -0.273585 -0.517241 0.107143 0.789474 -0.298969 -0.4500 -0.487179 -0.642857 0.142857 0.878788 -0.247863 -0.351351 -0.149254 -0.454545 0.101449 0.947368 -0.547826 -0.435583 -0.419847 -0.705882 -0.500000 -0.898990 -0.612422 -0.343258 -0.576744 -0.695341 -0.505464 -0.900129 1
3 3 0 40th 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 ... -1.000000 -1.000000 -1.000000 -1.000000 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 -0.002798 -0.546256 -0.270189 -0.535593 0.033571 0.694035 0.086420 -0.538462 -0.301887 -0.517241 -0.035714 0.736842 -0.381443 -0.6250 -0.521368 -0.857143 0.120879 0.171717 0.145299 -0.286486 0.477612 -0.272727 0.623188 1.000000 -0.078261 -0.190184 0.251908 -0.352941 -0.047619 -0.171717 -0.308696 -0.057718 -0.069094 -0.329749 -0.047619 -0.172436 0
4 4 0 10th 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 ... -0.926209 -1.000000 -0.698797 -0.960463 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.290762 -0.074271 0.051399 -0.499708 0.040640 0.820327 0.333333 -0.076923 0.056604 -0.517241 0.071429 0.789474 0.030928 -0.1250 -0.230769 -0.500000 0.208791 0.898990 0.094017 -0.178378 0.104478 -0.454545 0.014493 0.894737 -0.478261 -0.558282 -0.389313 -0.823529 -0.642857 -0.939394 -0.652174 -0.596165 -0.634847 -0.817204 -0.645793 -0.940077 0

5 rows × 230 columns

From the prior 1131 columns, our dataset was downsized to 230 columns. We are undoubtedly discarding a lot of data in this process. We shouldn't worry, though. We may always return and take another look at these features.

Let's examine how to handle the missing values for the remaining properties.

In [30]:
null_values = data.isnull().sum()
null_values = 100 * null_values[null_values > 0] / len(data)

null_values.sort_index()
Out[30]:
BLOODPRESSURE_DIASTOLIC_DIFF_1    64.415584
BLOODPRESSURE_DIASTOLIC_DIFF_2    54.025974
BLOODPRESSURE_DIASTOLIC_DIFF_3    43.636364
BLOODPRESSURE_DIASTOLIC_DIFF_4    15.584416
BLOODPRESSURE_DIASTOLIC_DIFF_5     0.259740
                                    ...    
TEMPERATURE_MIN_1                 68.311688
TEMPERATURE_MIN_2                 55.844156
TEMPERATURE_MIN_3                 43.896104
TEMPERATURE_MIN_4                 11.948052
TEMPERATURE_MIN_5                  0.259740
Length: 221, dtype: float64

The computed missing value percentage data appears to reveal a pretty intriguing trend. Early on in the hospital admission process, it appears that missing values exist; these values steadily disappear until the last measurement window, when they are almost zero. This is presumably because not all measures are taken right away after admission.

This brings us to the first method of filling in the missing values that we will employ. Essentially, it is assumed that measures begin to be recorded when the patient first shows signs of a change in their clinical status. We therefore presume that the first non-null value for a given patient could replace the prior null values.(I.E Backward filling)

In [31]:
#Split columns into time variant or patient constant
col_groups = np.unique([x[:-2] for x in data.columns.values], return_counts = True)
time_cols = [col_groups[0][x] for x in range(len(col_groups[0])) if col_groups[1][x] > 1]
constant_cols = [x + '_1' for x in col_groups[0] if x not in time_cols]

Observation:

This code splits the columns in the data into two groups: time-variant columns and patient constant columns. Time-variant columns are columns that contain multiple values for the same patient, such as multiple lab test results for a single patient over time. Patient constant columns are columns that contain the same value for a single patient across all time points, such as a patient's age or gender.

In [32]:
#Define function to fill the missing values on the time variant features
def fill_missing_values(data, time_group, const_group):
    
    group_df = [data[const_group]]
    for group in time_group:
        col_names = [group + '_' + str(x) for x in range(1, 6)]
        group_df.append(data[col_names].fillna(method = 'backfill', axis = 1).reset_index(drop = True))
    
    return pd.concat(group_df, axis = 1)
In [33]:
#Fill missing values
data = fill_missing_values(data, time_cols, constant_cols)

null_values = data.isnull().sum()
null_values[null_values > 0].sort_values(ascending = False)
Out[33]:
HTN_1                         1
OXYGEN_SATURATION_MEDIAN_2    1
OXYGEN_SATURATION_MAX_1       1
OXYGEN_SATURATION_MAX_2       1
OXYGEN_SATURATION_MAX_3       1
                             ..
DISEASE GROUPING 4_2          1
DISEASE GROUPING 4_3          1
DISEASE GROUPING 4_4          1
DISEASE GROUPING 4_5          1
TEMPERATURE_MIN_5             1
Length: 221, dtype: int64

We used backfill to fill our missing values.

Hwowever, even after using our method to fill in the missing values, most of the features still have a single null. Most likely, there is only one record here with very little data. Let's check that out.

In [34]:
#Check which rows still have missing values
null_values = data.isnull().sum(axis = 1)
null_rows = null_values[null_values > 0].index.values

print(null_rows)
[199]

PATIENT VISIT IDENTIFIER 199 does not have any record. Hence, we would be dropping this row

In [35]:
#Removing the PATIENT VISIT IDENTIFIER 199 missing values
data = data.drop(index = null_rows)
In [36]:
msno.matrix(data)
Out[36]:
<AxesSubplot:>

Observation:

  • No missing data

Remove Data Where ICU = 1¶

Previewing the dataset after this preprocessing, to see the current shape

In [37]:
data.shape
Out[37]:
(384, 230)
In [38]:
data.head()
Out[38]:
AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 HTN_1 PATIENT_VISIT_IDENTIFIER_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_2 BLOODPRESSURE_DIASTOLIC_DIFF_3 BLOODPRESSURE_DIASTOLIC_DIFF_4 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_2 BLOODPRESSURE_DIASTOLIC_DIFF_REL_3 BLOODPRESSURE_DIASTOLIC_DIFF_REL_4 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_DIASTOLIC_MAX_2 BLOODPRESSURE_DIASTOLIC_MAX_3 BLOODPRESSURE_DIASTOLIC_MAX_4 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEAN_2 BLOODPRESSURE_DIASTOLIC_MEAN_3 BLOODPRESSURE_DIASTOLIC_MEAN_4 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_2 BLOODPRESSURE_DIASTOLIC_MEDIAN_3 BLOODPRESSURE_DIASTOLIC_MEDIAN_4 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_DIASTOLIC_MIN_2 BLOODPRESSURE_DIASTOLIC_MIN_3 BLOODPRESSURE_DIASTOLIC_MIN_4 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_2 BLOODPRESSURE_SISTOLIC_DIFF_3 BLOODPRESSURE_SISTOLIC_DIFF_4 BLOODPRESSURE_SISTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_2 BLOODPRESSURE_SISTOLIC_DIFF_REL_3 BLOODPRESSURE_SISTOLIC_DIFF_REL_4 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MAX_2 BLOODPRESSURE_SISTOLIC_MAX_3 BLOODPRESSURE_SISTOLIC_MAX_4 BLOODPRESSURE_SISTOLIC_MAX_5 ... RESPIRATORY_RATE_MAX_1 RESPIRATORY_RATE_MAX_2 RESPIRATORY_RATE_MAX_3 RESPIRATORY_RATE_MAX_4 RESPIRATORY_RATE_MAX_5 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MEAN_2 RESPIRATORY_RATE_MEAN_3 RESPIRATORY_RATE_MEAN_4 RESPIRATORY_RATE_MEAN_5 RESPIRATORY_RATE_MEDIAN_1 RESPIRATORY_RATE_MEDIAN_2 RESPIRATORY_RATE_MEDIAN_3 RESPIRATORY_RATE_MEDIAN_4 RESPIRATORY_RATE_MEDIAN_5 RESPIRATORY_RATE_MIN_1 RESPIRATORY_RATE_MIN_2 RESPIRATORY_RATE_MIN_3 RESPIRATORY_RATE_MIN_4 RESPIRATORY_RATE_MIN_5 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_2 TEMPERATURE_DIFF_3 TEMPERATURE_DIFF_4 TEMPERATURE_DIFF_5 TEMPERATURE_DIFF_REL_1 TEMPERATURE_DIFF_REL_2 TEMPERATURE_DIFF_REL_3 TEMPERATURE_DIFF_REL_4 TEMPERATURE_DIFF_REL_5 TEMPERATURE_MAX_1 TEMPERATURE_MAX_2 TEMPERATURE_MAX_3 TEMPERATURE_MAX_4 TEMPERATURE_MAX_5 TEMPERATURE_MEAN_1 TEMPERATURE_MEAN_2 TEMPERATURE_MEAN_3 TEMPERATURE_MEAN_4 TEMPERATURE_MEAN_5 TEMPERATURE_MEDIAN_1 TEMPERATURE_MEDIAN_2 TEMPERATURE_MEDIAN_3 TEMPERATURE_MEDIAN_4 TEMPERATURE_MEDIAN_5 TEMPERATURE_MIN_1 TEMPERATURE_MIN_2 TEMPERATURE_MIN_3 TEMPERATURE_MIN_4 TEMPERATURE_MIN_5
0 1 60th 0 0.0 0 -1.000000 -1.000000 -0.339130 -0.339130 -0.339130 -1.000000 -1.000000 -0.389967 -0.389967 -0.389967 -0.247863 -0.076923 -0.076923 -0.076923 -0.076923 0.086420 0.333333 -0.243021 -0.243021 -0.243021 0.086420 0.333333 -0.283951 -0.283951 -0.283951 0.237113 0.443299 -0.340206 -0.340206 -0.340206 -1.000000 -1.000000 0.325153 0.325153 0.325153 -1.000000 -1.000000 0.407558 0.407558 0.407558 -0.459459 -0.459459 0.286486 0.286486 0.286486 ... -0.636364 -0.636364 0.272727 0.272727 0.272727 -0.593220 -0.593220 -0.317859 -0.317859 -0.317859 -0.586207 -0.586207 -0.379310 -0.379310 -0.379310 -0.500000 -0.500000 -0.857143 -0.857143 -0.857143 -1.000000 -1.000000 -1.000000 -1.000000 -0.238095 -1.000000 -1.000000 -1.000000 -1.000000 -0.242282 -0.420290 0.246377 -0.275362 -0.275362 0.362319 -0.285714 0.535714 -0.107143 -0.107143 0.033779 -0.285714 0.535714 -0.107143 -0.107143 0.035714 0.208791 0.714286 0.318681 0.318681 0.098901
1 1 90th 1 1.0 1 -1.000000 -1.000000 -1.000000 -0.913043 0.513043 -1.000000 -1.000000 -1.000000 -0.906832 0.315690 -0.504274 -0.589744 -0.589744 -0.572650 0.555556 -0.283951 -0.407407 -0.407407 -0.456790 -0.178122 -0.283951 -0.407407 -0.407407 -0.506173 -0.185185 -0.072165 -0.175258 -0.175258 -0.257732 -0.587629 -1.000000 -1.000000 -1.000000 -0.754601 0.472393 -1.000000 -1.000000 -1.000000 -0.831132 0.200359 -0.329730 0.329730 0.124324 0.167568 0.556757 ... 0.636364 -0.454545 -0.515152 -0.090909 0.757576 0.830508 -0.389831 -0.457627 -0.145763 -0.380216 0.862069 -0.379310 -0.448276 -0.103448 -0.379310 1.000000 -0.285714 -0.357143 -0.142857 -1.000000 -1.000000 -1.000000 -1.000000 -0.880952 0.142857 -1.000000 -1.000000 -1.000000 -0.882574 0.139709 -0.275362 -0.275362 -0.217391 -0.014493 0.710145 -0.107143 -0.107143 -0.035714 0.114286 0.010915 -0.107143 -0.107143 -0.035714 0.142857 0.000000 0.318681 0.318681 0.362637 0.406593 0.010989
2 0 10th 0 0.0 2 -0.547826 -0.547826 -0.547826 -0.704348 -0.547826 -0.515528 -0.515528 -0.515528 -0.658863 -0.612422 -0.435897 -0.435897 -0.435897 -0.572650 -0.247863 -0.489712 -0.489712 -0.489712 -0.612654 -0.181070 -0.506173 -0.506173 -0.506173 -0.604938 -0.160494 -0.525773 -0.525773 -0.525773 -0.505155 -0.298969 -0.533742 -0.533742 -0.533742 -0.693252 -0.435583 -0.351328 -0.351328 -0.351328 -0.563758 -0.343258 -0.491892 -0.491892 -0.491892 -0.762162 -0.351351 ... -0.575758 -0.575758 -0.575758 -0.696970 -0.454545 -0.645951 -0.645951 -0.645951 -0.720339 -0.543785 -0.517241 -0.517241 -0.517241 -0.724138 -0.517241 -0.714286 -0.714286 -0.714286 -0.642857 -0.642857 -1.000000 -1.000000 -1.000000 -1.000000 -0.500000 -1.000000 -1.000000 -1.000000 -1.000000 -0.505464 0.101449 0.101449 0.101449 0.101449 0.101449 0.357143 0.357143 0.357143 0.357143 0.057292 0.357143 0.357143 0.357143 0.357143 0.107143 0.604396 0.604396 0.604396 0.604396 0.142857
3 0 40th 1 0.0 3 -1.000000 -1.000000 -1.000000 -1.000000 -0.078261 -1.000000 -1.000000 -1.000000 -1.000000 -0.308696 -0.299145 -0.299145 -0.299145 -0.589744 0.145299 0.012346 0.012346 0.012346 -0.407407 -0.002798 0.012346 0.012346 0.012346 -0.407407 0.086420 0.175258 0.175258 0.175258 -0.175258 -0.381443 -1.000000 -1.000000 -1.000000 -1.000000 -0.190184 -1.000000 -1.000000 -1.000000 -1.000000 -0.057718 -0.556757 -0.556757 -0.556757 -0.675676 -0.286486 ... -0.515152 -0.515152 -0.515152 -0.575758 -0.272727 -0.457627 -0.457627 -0.457627 -0.525424 -0.535593 -0.448276 -0.448276 -0.448276 -0.517241 -0.517241 -0.357143 -0.357143 -0.357143 -0.428571 -0.857143 -1.000000 -1.000000 -1.000000 -1.000000 -0.047619 -1.000000 -1.000000 -1.000000 -1.000000 -0.047619 -0.420290 -0.420290 -0.420290 -0.275362 0.623188 -0.285714 -0.285714 -0.285714 -0.107143 0.033571 -0.285714 -0.285714 -0.285714 -0.107143 -0.035714 0.208791 0.208791 0.208791 0.318681 0.120879
4 0 10th 0 0.0 4 -1.000000 -1.000000 -1.000000 -1.000000 -0.478261 -1.000000 -1.000000 -1.000000 -1.000000 -0.652174 -0.076923 -0.076923 -0.076923 -0.247863 0.094017 0.333333 0.333333 0.333333 0.086420 0.290762 0.333333 0.333333 0.333333 0.086420 0.333333 0.443299 0.443299 0.443299 0.237113 0.030928 -0.877301 -0.877301 -0.877301 -1.000000 -0.558282 -0.883669 -0.883669 -0.883669 -1.000000 -0.596165 -0.351351 -0.351351 -0.351351 -0.351351 -0.178378 ... -0.575758 -0.575758 -0.575758 -0.575758 -0.454545 -0.593220 -0.593220 -0.593220 -0.525424 -0.499708 -0.586207 -0.586207 -0.586207 -0.517241 -0.517241 -0.571429 -0.571429 -0.571429 -0.428571 -0.500000 -0.952381 -0.952381 -0.952381 -0.690476 -0.642857 -0.953536 -0.953536 -0.953536 -0.698797 -0.645793 0.072464 0.072464 0.072464 0.130435 0.014493 0.285714 0.285714 0.285714 0.241071 0.040640 0.285714 0.285714 0.285714 0.321429 0.071429 0.538462 0.538462 0.538462 0.340659 0.208791

5 rows × 230 columns

We want to explore the dataset. However, there's an issue we need to resolve. As we trying to create a model to predict whether or not a patient will require ICU care. As instructed by the dataset providers, the data gathered after the ICU admission should not be taken into consideration. Since we are not going to use this data to model ICU admission, it does not make sense to explore this data. Let's proceed to removing this data.

In [39]:
#Define function to remove data measured after ICU admission
def remove_ICU_data(data):
    
    df_list = []
    for n in range(1, 6):
        cols = [x for x in data.columns.values if int(x[-1]) == n]
        ICU_col = 'ICU_' + str(n)
        df_list.append(data[data[ICU_col] == 0][cols].drop(columns = ICU_col))
        
    return pd.concat(df_list, axis = 1)
In [40]:
#Remove data after ICU admission
data = remove_ICU_data(data)

Let's now have another look at our data.

In [41]:
data.shape
Out[41]:
(352, 225)
In [42]:
data
Out[42]:
AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 HTN_1 PATIENT_VISIT_IDENTIFIER_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 BLOODPRESSURE_SISTOLIC_MIN_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HEART_RATE_DIFF_1 HEART_RATE_DIFF_REL_1 HEART_RATE_MAX_1 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 HEART_RATE_MIN_1 IMMUNOCOMPROMISED_1 OTHER_1 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 OXYGEN_SATURATION_MAX_1 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MEDIAN_1 OXYGEN_SATURATION_MIN_1 RESPIRATORY_RATE_DIFF_1 RESPIRATORY_RATE_DIFF_REL_1 RESPIRATORY_RATE_MAX_1 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MEDIAN_1 RESPIRATORY_RATE_MIN_1 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 TEMPERATURE_MAX_1 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 TEMPERATURE_MIN_1 BLOODPRESSURE_DIASTOLIC_DIFF_2 ... TEMPERATURE_DIFF_4 TEMPERATURE_DIFF_REL_4 TEMPERATURE_MAX_4 TEMPERATURE_MEAN_4 TEMPERATURE_MEDIAN_4 TEMPERATURE_MIN_4 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_MAX_5 BLOODPRESSURE_SISTOLIC_MEAN_5 BLOODPRESSURE_SISTOLIC_MEDIAN_5 BLOODPRESSURE_SISTOLIC_MIN_5 DISEASE GROUPING 1_5 DISEASE GROUPING 2_5 DISEASE GROUPING 3_5 DISEASE GROUPING 4_5 DISEASE GROUPING 5_5 DISEASE GROUPING 6_5 HEART_RATE_DIFF_5 HEART_RATE_DIFF_REL_5 HEART_RATE_MAX_5 HEART_RATE_MEAN_5 HEART_RATE_MEDIAN_5 HEART_RATE_MIN_5 IMMUNOCOMPROMISED_5 OTHER_5 OXYGEN_SATURATION_DIFF_5 OXYGEN_SATURATION_DIFF_REL_5 OXYGEN_SATURATION_MAX_5 OXYGEN_SATURATION_MEAN_5 OXYGEN_SATURATION_MEDIAN_5 OXYGEN_SATURATION_MIN_5 RESPIRATORY_RATE_DIFF_5 RESPIRATORY_RATE_DIFF_REL_5 RESPIRATORY_RATE_MAX_5 RESPIRATORY_RATE_MEAN_5 RESPIRATORY_RATE_MEDIAN_5 RESPIRATORY_RATE_MIN_5 TEMPERATURE_DIFF_5 TEMPERATURE_DIFF_REL_5 TEMPERATURE_MAX_5 TEMPERATURE_MEAN_5 TEMPERATURE_MEDIAN_5 TEMPERATURE_MIN_5
0 1 60th 0 0.0 0 -1.000000 -1.000000 -0.247863 0.086420 0.086420 0.237113 -1.000000 -1.000000 -0.459459 -0.230769 -0.230769 0.0000 0.0 0.0 0.0 0.0 1.0 1.0 -1.000000 -1.000000 -0.432836 -0.283019 -0.283019 -0.162393 0.0 1.0 -1.000000 -1.000000 0.736842 0.736842 0.736842 0.898990 -1.000000 -1.000000 -0.636364 -0.593220 -0.586207 -0.500000 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 -1.000000 ... -1.000000 -1.000000 -0.275362 -0.107143 -0.107143 0.318681 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 10th 0 0.0 2 -0.547826 -0.515528 -0.435897 -0.489712 -0.506173 -0.525773 -0.533742 -0.351328 -0.491892 -0.685470 -0.815385 -0.5125 0.0 0.0 0.0 0.0 0.0 0.0 -0.603053 -0.747001 0.000000 -0.048218 -0.056604 -0.111111 0.0 1.0 -0.959596 -0.961262 1.000000 0.935673 0.947368 0.959596 -0.764706 -0.756272 -0.575758 -0.645951 -0.517241 -0.714286 -1.000000 -1.000000 0.101449 0.357143 0.357143 0.604396 -0.547826 ... -1.000000 -1.000000 0.101449 0.357143 0.357143 0.604396 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 40th 1 0.0 3 -1.000000 -1.000000 -0.299145 0.012346 0.012346 0.175258 -1.000000 -1.000000 -0.556757 -0.369231 -0.369231 -0.1125 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.626866 -0.528302 -0.528302 -0.384615 1.0 1.0 -1.000000 -1.000000 0.684211 0.684211 0.684211 0.878788 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 -1.000000 ... -1.000000 -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -0.078261 -0.308696 0.145299 -0.002798 0.086420 -0.381443 -0.190184 -0.057718 -0.286486 -0.546256 -0.538462 -0.6250 0.0 0.0 0.0 0.0 0.0 0.0 0.251908 -0.069094 0.477612 -0.270189 -0.301887 -0.521368 1.0 1.0 -0.171717 -0.172436 1.000000 0.694035 0.736842 0.171717 -0.352941 -0.329749 -0.272727 -0.535593 -0.517241 -0.857143 -0.047619 -0.047619 0.623188 0.033571 -0.035714 0.120879
4 0 10th 0 0.0 4 -1.000000 -1.000000 -0.076923 0.333333 0.333333 0.443299 -0.877301 -0.883669 -0.351351 -0.153846 -0.153846 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 -0.923664 -0.956805 -0.044776 0.160377 0.160377 0.196581 0.0 1.0 -0.979798 -0.980333 0.894737 0.868421 0.868421 0.939394 -0.882353 -0.870968 -0.575758 -0.593220 -0.586207 -0.571429 -0.952381 -0.953536 0.072464 0.285714 0.285714 0.538462 -1.000000 ... -0.690476 -0.698797 0.130435 0.241071 0.321429 0.340659 -0.478261 -0.652174 0.094017 0.290762 0.333333 0.030928 -0.558282 -0.596165 -0.178378 -0.074271 -0.076923 -0.1250 0.0 0.0 0.0 0.0 0.0 0.0 -0.389313 -0.634847 0.104478 0.051399 0.056604 -0.230769 0.0 1.0 -0.939394 -0.940077 0.894737 0.820327 0.789474 0.898990 -0.823529 -0.817204 -0.454545 -0.499708 -0.517241 -0.500000 -0.642857 -0.645793 0.014493 0.040640 0.071429 0.208791
5 0 10th 0 0.0 5 -0.826087 -0.860870 -0.247863 -0.037037 -0.037037 0.030928 -0.754601 -0.714460 -0.567568 -0.538462 -0.538462 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -0.984733 -0.986481 -0.626866 -0.537736 -0.537736 -0.401709 0.0 1.0 -0.979798 -0.980129 0.842105 0.815789 0.815789 0.919192 -1.000000 -1.000000 -0.575758 -0.525424 -0.517241 -0.428571 -0.976190 -0.975891 -0.333333 -0.196429 -0.196429 0.252747 -0.826087 ... -1.000000 -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -0.704348 -0.758651 -0.179487 -0.037037 -0.074074 -0.030928 -0.705521 -0.683267 -0.416216 -0.406838 -0.400000 -0.2500 0.0 0.0 0.0 0.0 0.0 0.0 -0.480916 -0.581849 -0.298507 -0.428721 -0.415094 -0.589744 0.0 1.0 -0.919192 -0.920927 1.000000 0.847953 0.842105 0.919192 -0.941176 -0.939068 -0.515152 -0.502825 -0.517241 -0.428571 -0.738095 -0.736640 -0.130435 -0.109127 -0.107143 0.186813
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
380 0 40th 1 0.0 380 -1.000000 -1.000000 -0.418803 -0.160494 -0.160494 0.030928 -1.000000 -1.000000 -0.783784 -0.692308 -0.692308 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 0.059701 0.339623 0.339623 0.401709 1.0 1.0 -1.000000 -1.000000 0.736842 0.736842 0.736842 0.898990 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -1.000000 -1.000000 -0.072464 0.142857 0.142857 0.472527 -1.000000 ... -1.000000 -1.000000 -0.188406 0.000000 0.000000 0.384615 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
381 1 Above 90th 0 0.0 381 -1.000000 -1.000000 -0.589744 -0.407407 -0.407407 -0.175258 -1.000000 -1.000000 -0.783784 -0.692308 -0.692308 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.432836 -0.283019 -0.283019 -0.162393 1.0 1.0 -1.000000 -1.000000 0.526316 0.526316 0.526316 0.818182 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -0.619048 -0.612627 0.072464 -0.059524 -0.250000 0.230769 -1.000000 ... -0.619048 -0.612627 0.072464 -0.059524 -0.250000 0.230769 -0.982609 -0.982609 -0.572650 -0.399177 -0.407407 -0.175258 -0.889571 -0.871507 -0.675676 -0.584615 -0.538462 -0.3625 1.0 0.0 0.0 0.0 0.0 0.0 -0.770992 -0.804670 -0.373134 -0.396226 -0.490566 -0.350427 1.0 1.0 -0.959596 -0.960052 0.789474 0.754386 0.789474 0.878788 -0.882353 -0.878136 -0.575758 -0.570621 -0.517241 -0.571429 -0.690476 -0.697169 0.188406 0.238095 0.250000 0.384615
382 0 50th 0 0.0 382 -1.000000 -1.000000 -0.299145 0.012346 0.012346 0.175258 -1.000000 -1.000000 -0.567568 -0.384615 -0.384615 -0.1250 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.462687 -0.320755 -0.320755 -0.196581 0.0 0.0 -1.000000 -1.000000 0.894737 0.894737 0.894737 0.959596 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -1.000000 -1.000000 -0.246377 -0.071429 -0.071429 0.340659 -1.000000 ... -1.000000 -1.000000 -0.304348 -0.142857 -0.142857 0.296703 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
383 0 40th 1 0.0 383 -1.000000 -1.000000 -0.247863 0.086420 0.086420 0.237113 -1.000000 -1.000000 -0.459459 -0.230769 -0.230769 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.447761 -0.301887 -0.301887 -0.179487 0.0 0.0 -1.000000 -1.000000 0.736842 0.736842 0.736842 0.898990 -1.000000 -1.000000 -0.696970 -0.661017 -0.655172 -0.571429 -1.000000 -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -1.000000 ... -1.000000 -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -0.478261 -0.552795 -0.076923 -0.083298 -0.160494 -0.175258 -0.644172 -0.585967 -0.470270 -0.478691 -0.538462 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -0.358779 -0.557252 -0.029851 -0.190414 -0.188679 -0.418803 0.0 1.0 -0.838384 -0.838524 0.894737 0.705989 0.736842 0.797980 -0.588235 -0.573477 -0.393939 -0.541009 -0.517241 -0.714286 -0.571429 -0.572609 0.043478 0.036125 0.000000 0.164835
384 0 50th 1 0.0 384 -1.000000 -1.000000 -0.299145 0.012346 0.012346 0.175258 -1.000000 -1.000000 -0.502703 -0.292308 -0.292308 -0.0500 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.164179 0.056604 0.056604 0.145299 0.0 1.0 -1.000000 -1.000000 0.789474 0.789474 0.789474 0.919192 -1.000000 -1.000000 -0.575758 -0.525424 -0.517241 -0.428571 -1.000000 -1.000000 0.246377 0.535714 0.535714 0.714286 -1.000000 ... -1.000000 -1.000000 0.043478 0.285714 0.285714 0.560440 -0.652174 -0.701863 -0.247863 -0.185185 -0.160494 -0.175258 -0.644172 -0.585967 -0.470270 -0.539103 -0.538462 -0.3750 0.0 0.0 1.0 0.0 0.0 0.0 -0.633588 -0.763868 -0.149254 -0.107704 -0.075472 -0.247863 0.0 1.0 -0.838384 -0.835052 0.842105 0.662281 0.631579 0.777778 -0.647059 -0.612903 -0.515152 -0.610169 -0.586207 -0.785714 -0.547619 -0.551337 0.101449 0.050595 0.071429 0.186813

352 rows × 225 columns

Observation:

  • The first is pretty simple to understand—a decrease in the number of columns. The removal of the ICU attributes in our function leaves exactly 5 less columns.
  • On the other hand, the smaller number of records suggests some patients were admitted directly to the ICU. Their information was erased since it was useless for the task we were trying to do.

Consolidate Data¶

One of the things we performed when starting this assignment was assemble the entrance data to determine whether each patient ultimately needed ICU care. It's a good idea to put everything together at this point as our data is already in the format we wish to use.

In [43]:
#Join data together
data = data.join(other = admission_data.set_index('PATIENT_VISIT_IDENTIFIER').drop(columns = 'WINDOW'),
                 on = 'PATIENT_VISIT_IDENTIFIER_1',
                 how = 'inner')

data.head()
Out[43]:
AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 HTN_1 PATIENT_VISIT_IDENTIFIER_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 BLOODPRESSURE_SISTOLIC_MIN_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HEART_RATE_DIFF_1 HEART_RATE_DIFF_REL_1 HEART_RATE_MAX_1 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 HEART_RATE_MIN_1 IMMUNOCOMPROMISED_1 OTHER_1 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 OXYGEN_SATURATION_MAX_1 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MEDIAN_1 OXYGEN_SATURATION_MIN_1 RESPIRATORY_RATE_DIFF_1 RESPIRATORY_RATE_DIFF_REL_1 RESPIRATORY_RATE_MAX_1 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MEDIAN_1 RESPIRATORY_RATE_MIN_1 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 TEMPERATURE_MAX_1 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 TEMPERATURE_MIN_1 BLOODPRESSURE_DIASTOLIC_DIFF_2 ... TEMPERATURE_DIFF_REL_4 TEMPERATURE_MAX_4 TEMPERATURE_MEAN_4 TEMPERATURE_MEDIAN_4 TEMPERATURE_MIN_4 BLOODPRESSURE_DIASTOLIC_DIFF_5 BLOODPRESSURE_DIASTOLIC_DIFF_REL_5 BLOODPRESSURE_DIASTOLIC_MAX_5 BLOODPRESSURE_DIASTOLIC_MEAN_5 BLOODPRESSURE_DIASTOLIC_MEDIAN_5 BLOODPRESSURE_DIASTOLIC_MIN_5 BLOODPRESSURE_SISTOLIC_DIFF_5 BLOODPRESSURE_SISTOLIC_DIFF_REL_5 BLOODPRESSURE_SISTOLIC_MAX_5 BLOODPRESSURE_SISTOLIC_MEAN_5 BLOODPRESSURE_SISTOLIC_MEDIAN_5 BLOODPRESSURE_SISTOLIC_MIN_5 DISEASE GROUPING 1_5 DISEASE GROUPING 2_5 DISEASE GROUPING 3_5 DISEASE GROUPING 4_5 DISEASE GROUPING 5_5 DISEASE GROUPING 6_5 HEART_RATE_DIFF_5 HEART_RATE_DIFF_REL_5 HEART_RATE_MAX_5 HEART_RATE_MEAN_5 HEART_RATE_MEDIAN_5 HEART_RATE_MIN_5 IMMUNOCOMPROMISED_5 OTHER_5 OXYGEN_SATURATION_DIFF_5 OXYGEN_SATURATION_DIFF_REL_5 OXYGEN_SATURATION_MAX_5 OXYGEN_SATURATION_MEAN_5 OXYGEN_SATURATION_MEDIAN_5 OXYGEN_SATURATION_MIN_5 RESPIRATORY_RATE_DIFF_5 RESPIRATORY_RATE_DIFF_REL_5 RESPIRATORY_RATE_MAX_5 RESPIRATORY_RATE_MEAN_5 RESPIRATORY_RATE_MEDIAN_5 RESPIRATORY_RATE_MIN_5 TEMPERATURE_DIFF_5 TEMPERATURE_DIFF_REL_5 TEMPERATURE_MAX_5 TEMPERATURE_MEAN_5 TEMPERATURE_MEDIAN_5 TEMPERATURE_MIN_5 ICU
0 1 60th 0 0.0 0 -1.000000 -1.000000 -0.247863 0.086420 0.086420 0.237113 -1.000000 -1.000000 -0.459459 -0.230769 -0.230769 0.0000 0.0 0.0 0.0 0.0 1.0 1.0 -1.000000 -1.000000 -0.432836 -0.283019 -0.283019 -0.162393 0.0 1.0 -1.000000 -1.000000 0.736842 0.736842 0.736842 0.898990 -1.000000 -1.000000 -0.636364 -0.593220 -0.586207 -0.500000 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 -1.000000 ... -1.000000 -0.275362 -0.107143 -0.107143 0.318681 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2 0 10th 0 0.0 2 -0.547826 -0.515528 -0.435897 -0.489712 -0.506173 -0.525773 -0.533742 -0.351328 -0.491892 -0.685470 -0.815385 -0.5125 0.0 0.0 0.0 0.0 0.0 0.0 -0.603053 -0.747001 0.000000 -0.048218 -0.056604 -0.111111 0.0 1.0 -0.959596 -0.961262 1.000000 0.935673 0.947368 0.959596 -0.764706 -0.756272 -0.575758 -0.645951 -0.517241 -0.714286 -1.000000 -1.000000 0.101449 0.357143 0.357143 0.604396 -0.547826 ... -1.000000 0.101449 0.357143 0.357143 0.604396 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
3 0 40th 1 0.0 3 -1.000000 -1.000000 -0.299145 0.012346 0.012346 0.175258 -1.000000 -1.000000 -0.556757 -0.369231 -0.369231 -0.1125 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.626866 -0.528302 -0.528302 -0.384615 1.0 1.0 -1.000000 -1.000000 0.684211 0.684211 0.684211 0.878788 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 -1.000000 ... -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -0.078261 -0.308696 0.145299 -0.002798 0.086420 -0.381443 -0.190184 -0.057718 -0.286486 -0.546256 -0.538462 -0.625 0.0 0.0 0.0 0.0 0.0 0.0 0.251908 -0.069094 0.477612 -0.270189 -0.301887 -0.521368 1.0 1.0 -0.171717 -0.172436 1.000000 0.694035 0.736842 0.171717 -0.352941 -0.329749 -0.272727 -0.535593 -0.517241 -0.857143 -0.047619 -0.047619 0.623188 0.033571 -0.035714 0.120879 0
4 0 10th 0 0.0 4 -1.000000 -1.000000 -0.076923 0.333333 0.333333 0.443299 -0.877301 -0.883669 -0.351351 -0.153846 -0.153846 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 -0.923664 -0.956805 -0.044776 0.160377 0.160377 0.196581 0.0 1.0 -0.979798 -0.980333 0.894737 0.868421 0.868421 0.939394 -0.882353 -0.870968 -0.575758 -0.593220 -0.586207 -0.571429 -0.952381 -0.953536 0.072464 0.285714 0.285714 0.538462 -1.000000 ... -0.698797 0.130435 0.241071 0.321429 0.340659 -0.478261 -0.652174 0.094017 0.290762 0.333333 0.030928 -0.558282 -0.596165 -0.178378 -0.074271 -0.076923 -0.125 0.0 0.0 0.0 0.0 0.0 0.0 -0.389313 -0.634847 0.104478 0.051399 0.056604 -0.230769 0.0 1.0 -0.939394 -0.940077 0.894737 0.820327 0.789474 0.898990 -0.823529 -0.817204 -0.454545 -0.499708 -0.517241 -0.500000 -0.642857 -0.645793 0.014493 0.040640 0.071429 0.208791 0
5 0 10th 0 0.0 5 -0.826087 -0.860870 -0.247863 -0.037037 -0.037037 0.030928 -0.754601 -0.714460 -0.567568 -0.538462 -0.538462 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -0.984733 -0.986481 -0.626866 -0.537736 -0.537736 -0.401709 0.0 1.0 -0.979798 -0.980129 0.842105 0.815789 0.815789 0.919192 -1.000000 -1.000000 -0.575758 -0.525424 -0.517241 -0.428571 -0.976190 -0.975891 -0.333333 -0.196429 -0.196429 0.252747 -0.826087 ... -1.000000 -0.275362 -0.107143 -0.107143 0.318681 -0.704348 -0.758651 -0.179487 -0.037037 -0.074074 -0.030928 -0.705521 -0.683267 -0.416216 -0.406838 -0.400000 -0.250 0.0 0.0 0.0 0.0 0.0 0.0 -0.480916 -0.581849 -0.298507 -0.428721 -0.415094 -0.589744 0.0 1.0 -0.919192 -0.920927 1.000000 0.847953 0.842105 0.919192 -0.941176 -0.939068 -0.515152 -0.502825 -0.517241 -0.428571 -0.738095 -0.736640 -0.130435 -0.109127 -0.107143 0.186813 0

5 rows × 226 columns

In [44]:
len(data)
Out[44]:
352

Observation:

  • This Dataset contain 351 patients, with the exception of patients admitted to ICU in the 0-2 hours WINDOW

Data Exploration, Univariate and Bivariate Exploration ¶

After all this effort, we will now examine our dataset once again. This time, though, we're going to use graphs to attempt and understand how these traits operate on their own and in combination.

Patient-Constant Features¶

In [45]:
#Identify patient-constant features
col_groups = [x[:-2] for x in data.drop(columns = ['PATIENT_VISIT_IDENTIFIER_1', 'ICU']).columns.values]
col_groups = np.unique(col_groups , return_counts = True)

patient_constant_cols = [col_groups[0][x] for x in range(len(col_groups[0])) if col_groups[1][x] == 1]
patient_constant_cols
Out[45]:
['AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER', 'HTN']

Observation:

  • Patient constant features are features that contain the same value for a single patient across all time points, such as a patient's 'AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER', 'HTN'.
In [46]:
#AGE_ABOVE65 and AGE_PERCENTIL
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 4))
sns.countplot(data['AGE_ABOVE65_1'], ax = axis[0])
sns.countplot(data['AGE_PERCENTIL_1'], ax = axis[1])
plt.show(fig)

Observations:

  • AGE_ABOVE65 is represented with 1, which is over 150
In [47]:
#GENDER and HTN
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 4))
sns.countplot(data['GENDER_1'], ax = axis[0])
sns.countplot(data['HTN_1'], ax = axis[1])
plt.show(fig)

Asides for HTN, all of these traits are quite evenly distributed. Now let's examine how they directly relate to the target.

In [48]:
#Define function for a normalized staked bars plot
def normalized_stacked_bars(data, col, target):

    bottom = [0 for x in data[col].unique()]
    for cls in data[target].unique():
        x_vals, y_vals = np.unique(data[data[target] == cls][col], return_counts = True)
        x_vals = [str(x) for x in x_vals] 
        y_vals = [x / y for x, y in zip(y_vals, np.unique(data[col], return_counts = True)[1])]
        
        plt.bar(x_vals, y_vals, bottom = bottom, color = np.random.rand(1,3))
        bottom = [x + y for x, y in zip(bottom, y_vals)]
    
    plt.legend(np.unique(data[target]), title = target)
    plt.title(col)
    
    return plt.show()
In [49]:
#AGE_ABOVE65
normalized_stacked_bars(data, 'AGE_ABOVE65_1', 'ICU')

Observation:

  • We can deduce that a lot of Age above 65 was admitted to ICU
In [50]:
#AGE_PERCENTIL
normalized_stacked_bars(data, 'AGE_PERCENTIL_1', 'ICU')
In [51]:
#GENDER
normalized_stacked_bars(data, 'GENDER_1', 'ICU')

observations:

  • Higher percentage of male was admitted to ICU than female
In [52]:
#HTN
normalized_stacked_bars(data, 'HTN_1', 'ICU')

Observation:

  • There is clearly some relationship between the features and the target for AGE ABOVE65, GENDER, and HTN.
  • We can see that there is a strong correlation for AGE PERCENTIL.

Time-Variant Features¶

Time-variant features are features that contain multiple values for the same patient, such as multiple lab test results for a single patient over time. We're going to use two methods to address that. No matter when a measurement was obtained, we will first investigate it. In the section that follows, we'll examine these qualities as a time series.

First Approach: Investigate Entire Feature Groups¶

Here, we'll aim to make our analysis as simple as possible. Even if we use the group technique, there are still a lot of properties that we may examine separately from the many groups. The knowledge that the groups can be further divided into categories is one way out of this.

We can group the feature groups together in a few different ways. The measuring type will be the one we choose. For instance, the feature groups HEART RATE MAX and BLOODPRESSURE DIASTOLIC MAX are a component of the same MAX cluster.

You'll see that some of the features cannot be thus classified. We'll take a more appropriate approach to how we examine them.

In [53]:
#Identifying time variant features and groups of features
patient_constant_cols = [x + '_1' for x in patient_constant_cols]

time_variant_cols = [x for x in data.columns.values if x not in patient_constant_cols]
time_variant_cols.remove('PATIENT_VISIT_IDENTIFIER_1')
time_variant_cols.remove('ICU')

time_variant_groups = np.unique([x[:-2] for x in time_variant_cols])
print(time_variant_groups)
['BLOODPRESSURE_DIASTOLIC_DIFF' 'BLOODPRESSURE_DIASTOLIC_DIFF_REL'
 'BLOODPRESSURE_DIASTOLIC_MAX' 'BLOODPRESSURE_DIASTOLIC_MEAN'
 'BLOODPRESSURE_DIASTOLIC_MEDIAN' 'BLOODPRESSURE_DIASTOLIC_MIN'
 'BLOODPRESSURE_SISTOLIC_DIFF' 'BLOODPRESSURE_SISTOLIC_DIFF_REL'
 'BLOODPRESSURE_SISTOLIC_MAX' 'BLOODPRESSURE_SISTOLIC_MEAN'
 'BLOODPRESSURE_SISTOLIC_MEDIAN' 'BLOODPRESSURE_SISTOLIC_MIN'
 'DISEASE GROUPING 1' 'DISEASE GROUPING 2' 'DISEASE GROUPING 3'
 'DISEASE GROUPING 4' 'DISEASE GROUPING 5' 'DISEASE GROUPING 6'
 'HEART_RATE_DIFF' 'HEART_RATE_DIFF_REL' 'HEART_RATE_MAX'
 'HEART_RATE_MEAN' 'HEART_RATE_MEDIAN' 'HEART_RATE_MIN'
 'IMMUNOCOMPROMISED' 'OTHER' 'OXYGEN_SATURATION_DIFF'
 'OXYGEN_SATURATION_DIFF_REL' 'OXYGEN_SATURATION_MAX'
 'OXYGEN_SATURATION_MEAN' 'OXYGEN_SATURATION_MEDIAN'
 'OXYGEN_SATURATION_MIN' 'RESPIRATORY_RATE_DIFF'
 'RESPIRATORY_RATE_DIFF_REL' 'RESPIRATORY_RATE_MAX'
 'RESPIRATORY_RATE_MEAN' 'RESPIRATORY_RATE_MEDIAN' 'RESPIRATORY_RATE_MIN'
 'TEMPERATURE_DIFF' 'TEMPERATURE_DIFF_REL' 'TEMPERATURE_MAX'
 'TEMPERATURE_MEAN' 'TEMPERATURE_MEDIAN' 'TEMPERATURE_MIN']
In [54]:
#Identify the largers clusters

Disease_group_cluster = ['DISEASE GROUPING 1', 'DISEASE GROUPING 2', 'DISEASE GROUPING 3', 
                        'DISEASE GROUPING 4', 'DISEASE GROUPING 5', 'DISEASE GROUPING 6', 
                        'IMMUNOCOMPROMISED', 'OTHER']

clusters = np.unique([x.split('_')[-1] for x in time_variant_groups if x not in Disease_group_cluster])
print(clusters)
['DIFF' 'MAX' 'MEAN' 'MEDIAN' 'MIN' 'REL']
In [55]:
#Define function to compile all values from a feature group
def extract_values_from_group(data, group_name):
    group_cols = [x for x in data.columns.values if x[:-2] == group_name]
    return data[group_cols].values.reshape(-1)
In [56]:
#Define function to plot all feature groups from a cluster
def plot_by_cluster(data, col_groups, cluster_name):
    #Identify groups to be ploted
    groups = [x for x in col_groups if x[-len(cluster_name):] == cluster_name]
    
    #Compute dimensions for subplots
    ncols = 2
    nrows = int(len(groups) / 2) if len(groups) % 2 == 0 else np.floor(len(groups) / 2) + 1

    #Plot groups
    fig, axis = plt.subplots(nrows = nrows, ncols = ncols, figsize = (15, 3*nrows))
    for i, group in enumerate(groups):
        row = int(i / 2)
        col = 0 if i%2 == 0 else 1
        if data[group + '_1'].dtype == np.int64:
            sns.countplot(extract_values_from_group(data, group), ax = axis[row, col]).set_title(group)
        else:
            sns.distplot(extract_values_from_group(data, group), ax = axis[row, col]).set_title(group)
    
    fig.tight_layout()
    return plt.show() 
In [57]:
#DIFF
plot_by_cluster(data, time_variant_groups, 'DIFF')

Observation

  • The feature groups have some correlation for the DIFF cluster.
  • The majority of the current values are close to -1. Regarding whether the associated properties can give our predictive model useful data, this is not a good sign. However, we won't make any conclusions just yet.
In [58]:
#REL
plot_by_cluster(data, time_variant_groups, 'DIFF_REL')

Observations:

  • The DIFF REL cluster exhibits exactly the same behaviour as the prior cluster.
  • This is understandable given that, as their name implies, they are essentially a scaled-down version of the feature groups that were previously displayed.

We will cluster the following features in a different way, according to the health factor being assessed. Contrary to the DIFF characteristics, we do not anticipate that various measures will perform similarly.

In [59]:
#Identify remaining groups for plotting
reduced_time_variant_groups = \
[x for x in time_variant_groups if ('DIFF' not in x and x not in Disease_group_cluster)]

new_clusters = np.unique(['_'.join(x.split('_')[:-1]) for x in reduced_time_variant_groups])
new_clusters
Out[59]:
array(['BLOODPRESSURE_DIASTOLIC', 'BLOODPRESSURE_SISTOLIC', 'HEART_RATE',
       'OXYGEN_SATURATION', 'RESPIRATORY_RATE', 'TEMPERATURE'],
      dtype='<U23')
In [60]:
#Redefine function to plot all feature groups from a cluster
def plot_by_cluster(data, col_groups, cluster_name):
    #Identify groups to be ploted
    groups = [x for x in col_groups if x[:len(cluster_name)] == cluster_name]
    
    #Compute dimensions for subplots
    ncols = 2
    nrows = int(len(groups) / 2) if len(groups) % 2 == 0 else np.floor(len(groups) / 2) + 1

    #Plot groups
    fig, axis = plt.subplots(nrows = nrows, ncols = ncols, figsize = (15, 3*nrows))
    for i, group in enumerate(groups):
        row = int(i / 2)
        col = 0 if i%2 == 0 else 1
        if data[group + '_1'].dtype == np.int64:
            sns.countplot(extract_values_from_group(data, group), ax = axis[row, col]).set_title(group)
        else:
            sns.distplot(extract_values_from_group(data, group), ax = axis[row, col]).set_title(group)
    
    fig.tight_layout()
    return plt.show()
In [61]:
#BLOODPRESSURE_DIASTOLIC'
plot_by_cluster(data, reduced_time_variant_groups, 'BLOODPRESSURE_DIASTOLIC')
In [62]:
#BLOODPRESSURE_SISTOLIC
plot_by_cluster(data, reduced_time_variant_groups, 'BLOODPRESSURE_SISTOLIC')
In [63]:
#OXYGEN_SATURATION
plot_by_cluster(data, reduced_time_variant_groups, 'OXYGEN_SATURATION')
In [64]:
#HEART_RATE
plot_by_cluster(data, reduced_time_variant_groups, 'HEART_RATE')
In [65]:
#RESPIRATORY_RATE
plot_by_cluster(data, reduced_time_variant_groups, 'RESPIRATORY_RATE')
In [66]:
#TEMPERATURE
plot_by_cluster(data, reduced_time_variant_groups, 'TEMPERATURE')

This new method of clustering the feature groups makes a lot more sense, as we had thought. At least when we are not focusing on the time component, it is evident that the measurements for each cluster behave fairly similarly. The results for a single feature cluster are probably associated, potentially strongly, which confirms what we already suspected. We now understand the behaviour of our data much better. The next step is to examine the impact of the time component on our analysis.

Second Approach: Chronological¶

In [67]:
#Define function to plot feature group as series of boxplots
def plot_time_series(data, group, axs):
    
    x_vals = []
    y_vals = []
    for n in range(1,6):
        x_vals.extend([n for x in range(len(data))])
        y_vals.extend(data[group + '_' + str(n)])
        
    return sns.boxplot(x = x_vals, y = y_vals, ax = axs)
In [68]:
time_variant_groups
Out[68]:
array(['BLOODPRESSURE_DIASTOLIC_DIFF', 'BLOODPRESSURE_DIASTOLIC_DIFF_REL',
       'BLOODPRESSURE_DIASTOLIC_MAX', 'BLOODPRESSURE_DIASTOLIC_MEAN',
       'BLOODPRESSURE_DIASTOLIC_MEDIAN', 'BLOODPRESSURE_DIASTOLIC_MIN',
       'BLOODPRESSURE_SISTOLIC_DIFF', 'BLOODPRESSURE_SISTOLIC_DIFF_REL',
       'BLOODPRESSURE_SISTOLIC_MAX', 'BLOODPRESSURE_SISTOLIC_MEAN',
       'BLOODPRESSURE_SISTOLIC_MEDIAN', 'BLOODPRESSURE_SISTOLIC_MIN',
       'DISEASE GROUPING 1', 'DISEASE GROUPING 2', 'DISEASE GROUPING 3',
       'DISEASE GROUPING 4', 'DISEASE GROUPING 5', 'DISEASE GROUPING 6',
       'HEART_RATE_DIFF', 'HEART_RATE_DIFF_REL', 'HEART_RATE_MAX',
       'HEART_RATE_MEAN', 'HEART_RATE_MEDIAN', 'HEART_RATE_MIN',
       'IMMUNOCOMPROMISED', 'OTHER', 'OXYGEN_SATURATION_DIFF',
       'OXYGEN_SATURATION_DIFF_REL', 'OXYGEN_SATURATION_MAX',
       'OXYGEN_SATURATION_MEAN', 'OXYGEN_SATURATION_MEDIAN',
       'OXYGEN_SATURATION_MIN', 'RESPIRATORY_RATE_DIFF',
       'RESPIRATORY_RATE_DIFF_REL', 'RESPIRATORY_RATE_MAX',
       'RESPIRATORY_RATE_MEAN', 'RESPIRATORY_RATE_MEDIAN',
       'RESPIRATORY_RATE_MIN', 'TEMPERATURE_DIFF', 'TEMPERATURE_DIFF_REL',
       'TEMPERATURE_MAX', 'TEMPERATURE_MEAN', 'TEMPERATURE_MEDIAN',
       'TEMPERATURE_MIN'], dtype='<U32')
In [69]:
#Define function to plot the time series in clusters of feature groups
def plot_time_series_cluster(data, cluster_name):
    groups = [x for x in time_variant_groups if x[:len(cluster_name)] == cluster_name]
    
    ncols = 2
    nrows = int(len(groups) / 2) if len(groups) % 2 == 0 else np.floor(len(groups) / 2) + 1

    #Plot groups
    fig, axis = plt.subplots(nrows = nrows, ncols = ncols, figsize = (15, 3*nrows))
    for i, group in enumerate(groups):
        row = int(i / 2)
        col = 0 if i%2 == 0 else 1
        plot_time_series(data, group, axis[row, col]).set_title(group)
    
    fig.tight_layout()
    return plt.show()
In [70]:
#BLOODPRESSURE_DIASTOLIC
plot_time_series_cluster(data, 'BLOODPRESSURE_DIASTOLIC')

Observation:

  • The DIFF feature groups show the highest variance over time for the BLOODPRESSURE DIASTOLIC. The other features largely follow the same behaviour from Windows 1 to Windows 4, with some notable changes on the last time step.
  • The outliers are yet another intriguing feature. In the first two windows, the last four traits occur more frequently.
In [71]:
#BLOODPRESSURE_SISTOLIC
plot_time_series_cluster(data, 'BLOODPRESSURE_SISTOLIC')

Observation:

  • Similar to their BLOODPRESSURE DIASTOLIC counterparts, the BLOODPRESSURE SISTOLIC characteristics exhibit behaviour. The only notable variation is the BLOODPRESSURE SISTOLIC MIN, which starts to decline gradually but clearly on window 3.
In [72]:
#HEART_RATE
plot_time_series_cluster(data, 'HEART_RATE')

Observation:

  • Once more, the HEART RATE characteristics exhibit behaviour that is comparable to that of the earlier feature groups. Given that this trend appears to be prevalent in our data, it is intriguing to investigate more and determine what it signifies.
  • The ideal situation from the perspective of the problem at hand is when we can determine whether or not a patient needs to be admitted to the ICU simply on its window 1 measures. The medical staff would be more ready as a result, and hospital resources might be used to their best advantage. Then, it would be ideal if the "window 1" features behaved flawlessly and showed the fewest outliers. This is not the case, though.

  • Another crucial finding is to avoid taking the information from the most recent windows "too seriously." I do not imply that the data is inaccurate. The quantity of information we now have is the true problem here. Keep in mind that we are eliminating all data collected following the ICU admission. For instance, only about 25% of the data in the most recent window is complete. While this does not always imply that the features are being ignored, we must be caution when taking inferences from datasets of this size.

In [73]:
#OXYGEN_SATURATION
plot_time_series_cluster(data, 'OXYGEN_SATURATION')

Observation:

  • The current plots are much different from those in the earlier feature groups. The data distribution for OXYGEN SATURATION is substantially more constrained. Furthermore, despite the high number of outliers in the "window 1" measures, no window-related trend appears to exist.
In [74]:
#RESPIRATORY_RATE
plot_time_series_cluster(data, 'RESPIRATORY_RATE')

Observation:

  • The distributions for the RESPIRATORY RATE characteristics are quite constrained. The windows 1, 2, and 3 features account for the majority of the outliers.
In [75]:
#TEMPERATURE
plot_time_series_cluster(data, 'TEMPERATURE')

Observation:

  • The TEMPERATURE features put up until window 3 appear to work similarly to the BLOODPRESSURE features. The data distribution may then have undergone some important changes.

As we conclude this part, it is important to emphasise that our research will not always provide startling revelations. It's okay. The time we spent getting to know the dataset is what matters most in this situation. We can undertake a lot more exploratory analysis, of course.

The Remainig Features¶

A few features have not yet been looked into. We will simultaneously analyse from the two prior viewpoints. But let's first go through what they are.

In [76]:
Disease_group_cluster
Out[76]:
['DISEASE GROUPING 1',
 'DISEASE GROUPING 2',
 'DISEASE GROUPING 3',
 'DISEASE GROUPING 4',
 'DISEASE GROUPING 5',
 'DISEASE GROUPING 6',
 'IMMUNOCOMPROMISED',
 'OTHER']
In [77]:
#Redefine the plot_time_series function to use violinplots instead of boxplots
def violinplot_time_series(data, group, axs):
    
    x_vals = []
    y_vals = []
    for n in range(1,6):
        x_vals.extend([n for x in range(len(data))])
        y_vals.extend(data[group + '_' + str(n)])
        
    return sns.violinplot(x = x_vals, y = y_vals, ax = axs)
In [78]:
#IMMUNOCOMPROMISED
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'IMMUNOCOMPROMISED'), ax = axis[0])
violinplot_time_series(data, 'IMMUNOCOMPROMISED', axs = axis[1])
Out[78]:
<AxesSubplot:>
In [79]:
#OTHER
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'OTHER'), ax = axis[0])
violinplot_time_series(data, 'OTHER', axs = axis[1])
Out[79]:
<AxesSubplot:>
In [80]:
#DISEASE GROUPING 1
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'DISEASE GROUPING 1'), ax = axis[0])
violinplot_time_series(data, 'DISEASE GROUPING 1', axs = axis[1])
Out[80]:
<AxesSubplot:>
In [81]:
#DISEASE GROUPING 2
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'DISEASE GROUPING 2'), ax = axis[0])
violinplot_time_series(data, 'DISEASE GROUPING 2', axs = axis[1])
Out[81]:
<AxesSubplot:>
In [82]:
#DISEASE GROUPING 3
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'DISEASE GROUPING 3'), ax = axis[0])
violinplot_time_series(data, 'DISEASE GROUPING 3', axs = axis[1])
Out[82]:
<AxesSubplot:>
In [83]:
#DISEASE GROUPING 4
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'DISEASE GROUPING 4'), ax = axis[0])
violinplot_time_series(data, 'DISEASE GROUPING 1', axs = axis[1])
Out[83]:
<AxesSubplot:>
In [84]:
#DISEASE GROUPING 5
fig, axis = plt.subplots(nrows = 1, ncols = 2, figsize = (12,4))
sns.countplot(extract_values_from_group(data, 'DISEASE GROUPING 1'), ax = axis[0])
violinplot_time_series(data, 'DISEASE GROUPING 1', axs = axis[1])
Out[84]:
<AxesSubplot:>

Observation:

  • These traits all share a characteristic. For example, they are entirely binary and extremely skewed. Additionally, their time window distribution does not appear to vary significantly, especially if we assume that windows 4 and 5 experience the majority of the changes as a result of an increase in the number of missing values.

  • All of this suggests that these properties aren't genuinely time-variable. If such is the case, we must eliminate the redundant features so that they do not obstruct our efforts. Check to see if this suspicion is correct.

In [85]:
#Define function to compute the percentage of records in which the feature value changes over time
def find_feature_value_change(data, group):
    group_cols = [x for x in data.columns if x[:len(group)] == group]
    summarized_data = data.groupby(by = 'PATIENT_VISIT_IDENTIFIER_1').max()[group_cols]
    data_change = summarized_data.max(axis = 1) - summarized_data.min(axis = 1)
    
    return len(data_change[data_change != 0]) / len(data)
In [86]:
#Compute feature time variance
time_change_df = pd.DataFrame(data = [100* find_feature_value_change(data, x) for x in Disease_group_cluster],
                              index = Disease_group_cluster,
                              columns = ['% of records with value change'])

time_change_df.sort_values(by = '% of records with value change', ascending = False)
Out[86]:
% of records with value change
OTHER 24.147727
DISEASE GROUPING 1 2.840909
DISEASE GROUPING 3 1.136364
DISEASE GROUPING 5 1.136364
DISEASE GROUPING 6 1.136364
IMMUNOCOMPROMISED 0.852273
DISEASE GROUPING 2 0.284091
DISEASE GROUPING 4 0.284091

For the most part, it turned out that our hypothesis was correct. All the other features, with the exception of OTHER, show essentially no change over the time spans. Even so, these events might be substantially associated with the aim. It is crucial to explore this possibility before making any decisions.

In [87]:
#Define function to show grafically the correlation between ICU admission and feature change over time
def feature_target_plot(data, group, target):
    group_cols = [x for x in data.columns if x[:len(group)] == group]
    summarized_data = data.groupby(by = 'PATIENT_VISIT_IDENTIFIER_1').max()[group_cols]
    data_change = summarized_data.max(axis = 1) - summarized_data.min(axis = 1)
    
    change_rows = data_change[data_change != 0].index
    no_change_rows = data_change[data_change == 0].index
    
    bot = [0, 0]
    for value in data[target].unique():
        y_vals = [len(data[data[target] == value].filter(change_rows, axis = 'index')) / len(change_rows),
                  len(data[data[target] == value].filter(no_change_rows, axis = 'index')) / len(no_change_rows)]
        x_vals = ['Change', 'No Change']
        
        plt.bar(x_vals, y_vals, bottom = bot, color = np.random.rand(1,3))
        bot = [x + y for x, y in zip(bot, y_vals)]
    
    plt.legend(data[target].unique(), title = target)
    
    return plt.show()
In [88]:
#OTHER
feature_target_plot(data, 'OTHER', 'ICU')
In [89]:
#IMMUNOCOMPROMISED
feature_target_plot(data, 'IMMUNOCOMPROMISED', 'ICU')
In [90]:
#DISEASE GROUPING 1
feature_target_plot(data, 'DISEASE GROUPING 1', 'ICU')
In [91]:
#DISEASE GROUPING 2
feature_target_plot(data, 'DISEASE GROUPING 2', 'ICU')
In [92]:
#DISEASE GROUPING 3
feature_target_plot(data, 'DISEASE GROUPING 3', 'ICU')
In [93]:
#DISEASE GROUPING 4
feature_target_plot(data, 'DISEASE GROUPING 4', 'ICU')
In [94]:
#DISEASE GROUPING 5
feature_target_plot(data, 'DISEASE GROUPING 5', 'ICU')

Observation: These straightforward schemes reveal some extremely intriguing things. The two that stand out the most are:

  • When the value varies over time for the DISEASE GROUPING 1 through 4 features, the patient is not admitted to the intensive care unit.
  • The OTHER attribute, which has seen the largest change over time, exhibits a significant change in the likelihood of ICU admission.

The major takeaway from this is that we shouldn't ignore the time component of these features. Additionally, we discovered that the evolution of these traits over time may be used as a brand-new attribute for our model.

In [95]:
data.shape
Out[95]:
(352, 226)
In [96]:
#Remove all data take after window 1
model_1_cols = [x for x in data.columns if x[-1] not in [str(y) for y in range(2,6)]]
data_1 = data[model_1_cols]

data_1.head()
Out[96]:
AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 HTN_1 PATIENT_VISIT_IDENTIFIER_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 BLOODPRESSURE_SISTOLIC_MIN_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HEART_RATE_DIFF_1 HEART_RATE_DIFF_REL_1 HEART_RATE_MAX_1 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 HEART_RATE_MIN_1 IMMUNOCOMPROMISED_1 OTHER_1 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 OXYGEN_SATURATION_MAX_1 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MEDIAN_1 OXYGEN_SATURATION_MIN_1 RESPIRATORY_RATE_DIFF_1 RESPIRATORY_RATE_DIFF_REL_1 RESPIRATORY_RATE_MAX_1 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MEDIAN_1 RESPIRATORY_RATE_MIN_1 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 TEMPERATURE_MAX_1 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 TEMPERATURE_MIN_1 ICU
0 1 60th 0 0.0 0 -1.000000 -1.000000 -0.247863 0.086420 0.086420 0.237113 -1.000000 -1.000000 -0.459459 -0.230769 -0.230769 0.0000 0.0 0.0 0.0 0.0 1.0 1.0 -1.000000 -1.000000 -0.432836 -0.283019 -0.283019 -0.162393 0.0 1.0 -1.000000 -1.000000 0.736842 0.736842 0.736842 0.898990 -1.000000 -1.000000 -0.636364 -0.593220 -0.586207 -0.500000 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 1
2 0 10th 0 0.0 2 -0.547826 -0.515528 -0.435897 -0.489712 -0.506173 -0.525773 -0.533742 -0.351328 -0.491892 -0.685470 -0.815385 -0.5125 0.0 0.0 0.0 0.0 0.0 0.0 -0.603053 -0.747001 0.000000 -0.048218 -0.056604 -0.111111 0.0 1.0 -0.959596 -0.961262 1.000000 0.935673 0.947368 0.959596 -0.764706 -0.756272 -0.575758 -0.645951 -0.517241 -0.714286 -1.000000 -1.000000 0.101449 0.357143 0.357143 0.604396 1
3 0 40th 1 0.0 3 -1.000000 -1.000000 -0.299145 0.012346 0.012346 0.175258 -1.000000 -1.000000 -0.556757 -0.369231 -0.369231 -0.1125 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -1.000000 -0.626866 -0.528302 -0.528302 -0.384615 1.0 1.0 -1.000000 -1.000000 0.684211 0.684211 0.684211 0.878788 -1.000000 -1.000000 -0.515152 -0.457627 -0.448276 -0.357143 -1.000000 -1.000000 -0.420290 -0.285714 -0.285714 0.208791 0
4 0 10th 0 0.0 4 -1.000000 -1.000000 -0.076923 0.333333 0.333333 0.443299 -0.877301 -0.883669 -0.351351 -0.153846 -0.153846 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 -0.923664 -0.956805 -0.044776 0.160377 0.160377 0.196581 0.0 1.0 -0.979798 -0.980333 0.894737 0.868421 0.868421 0.939394 -0.882353 -0.870968 -0.575758 -0.593220 -0.586207 -0.571429 -0.952381 -0.953536 0.072464 0.285714 0.285714 0.538462 0
5 0 10th 0 0.0 5 -0.826087 -0.860870 -0.247863 -0.037037 -0.037037 0.030928 -0.754601 -0.714460 -0.567568 -0.538462 -0.538462 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -0.984733 -0.986481 -0.626866 -0.537736 -0.537736 -0.401709 0.0 1.0 -0.979798 -0.980129 0.842105 0.815789 0.815789 0.919192 -1.000000 -1.000000 -0.575758 -0.525424 -0.517241 -0.428571 -0.976190 -0.975891 -0.333333 -0.196429 -0.196429 0.252747 0
In [97]:
data_1.shape
Out[97]:
(352, 50)

Observation:

We have removed columns with feature relating to WINDOW period 0-2 hours. Hence, weare left with 352 rows and 50 columns

In [98]:
data_1.columns
Out[98]:
Index(['AGE_ABOVE65_1', 'AGE_PERCENTIL_1', 'GENDER_1', 'HTN_1',
       'PATIENT_VISIT_IDENTIFIER_1', 'BLOODPRESSURE_DIASTOLIC_DIFF_1',
       'BLOODPRESSURE_DIASTOLIC_DIFF_REL_1', 'BLOODPRESSURE_DIASTOLIC_MAX_1',
       'BLOODPRESSURE_DIASTOLIC_MEAN_1', 'BLOODPRESSURE_DIASTOLIC_MEDIAN_1',
       'BLOODPRESSURE_DIASTOLIC_MIN_1', 'BLOODPRESSURE_SISTOLIC_DIFF_1',
       'BLOODPRESSURE_SISTOLIC_DIFF_REL_1', 'BLOODPRESSURE_SISTOLIC_MAX_1',
       'BLOODPRESSURE_SISTOLIC_MEAN_1', 'BLOODPRESSURE_SISTOLIC_MEDIAN_1',
       'BLOODPRESSURE_SISTOLIC_MIN_1', 'DISEASE GROUPING 1_1',
       'DISEASE GROUPING 2_1', 'DISEASE GROUPING 3_1', 'DISEASE GROUPING 4_1',
       'DISEASE GROUPING 5_1', 'DISEASE GROUPING 6_1', 'HEART_RATE_DIFF_1',
       'HEART_RATE_DIFF_REL_1', 'HEART_RATE_MAX_1', 'HEART_RATE_MEAN_1',
       'HEART_RATE_MEDIAN_1', 'HEART_RATE_MIN_1', 'IMMUNOCOMPROMISED_1',
       'OTHER_1', 'OXYGEN_SATURATION_DIFF_1', 'OXYGEN_SATURATION_DIFF_REL_1',
       'OXYGEN_SATURATION_MAX_1', 'OXYGEN_SATURATION_MEAN_1',
       'OXYGEN_SATURATION_MEDIAN_1', 'OXYGEN_SATURATION_MIN_1',
       'RESPIRATORY_RATE_DIFF_1', 'RESPIRATORY_RATE_DIFF_REL_1',
       'RESPIRATORY_RATE_MAX_1', 'RESPIRATORY_RATE_MEAN_1',
       'RESPIRATORY_RATE_MEDIAN_1', 'RESPIRATORY_RATE_MIN_1',
       'TEMPERATURE_DIFF_1', 'TEMPERATURE_DIFF_REL_1', 'TEMPERATURE_MAX_1',
       'TEMPERATURE_MEAN_1', 'TEMPERATURE_MEDIAN_1', 'TEMPERATURE_MIN_1',
       'ICU'],
      dtype='object')

At this point, most data cleaning has already been performed. We have also looked at the features individually. Let's now focus on investigating the relationships between features and the target. First, we are going to look at how they correlate to each other.

Correlations¶

In [99]:
#Compute Pearson correlation
data_1_corr = data_1.corr()

sns.heatmap(data_1_corr)
plt.figure(figsize = (10, 8))
Out[99]:
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>

It is very easy to see this graphical approach is not very helpful for this amount of features. So let's split our analysis in two. First we are looking at how these features relate to each other, excluding the target column.

In [100]:
#Show correlation values in stacked format
def rank_correlation_score(data):
    
    #Stack correlation map into 3-columns format
    stacked_corr = data.corr().stack().reset_index().rename(
       columns = {'level_0': 'Feature_1',
                  'level_1': 'Feature_2',
                  0: 'Pearson_Corr'})
    
    #Remove redudant relationships
    stacked_corr = stacked_corr.query('Feature_1 != Feature_2')
    chained_feature_names = ['-'.join(np.sort(x)) for x in stacked_corr[['Feature_1', 'Feature_2']].values]
    stacked_corr.loc[:,'Duplicate_Key'] = chained_feature_names
    stacked_corr = stacked_corr.drop_duplicates(subset = 'Duplicate_Key').drop(columns = 'Duplicate_Key')

    #Remove correlations to the target
    stacked_corr = stacked_corr[stacked_corr['Feature_1'] != 'ICU']
    stacked_corr = stacked_corr[stacked_corr['Feature_2'] != 'ICU']
    
    # Order absolute correlation strenght
    stacked_corr['Pearson_Corr'] = abs(stacked_corr['Pearson_Corr'])
    return stacked_corr.sort_values(by = 'Pearson_Corr', ascending = False)

stacked_data_1_corr = rank_correlation_score(data_1)
stacked_data_1_corr
Out[100]:
Feature_1 Feature_2 Pearson_Corr
2101 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 0.999444
1501 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 0.998889
651 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 0.996952
1251 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 0.993374
2251 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 0.993236
... ... ... ...
733 BLOODPRESSURE_SISTOLIC_MEDIAN_1 TEMPERATURE_MIN_1 0.000809
924 DISEASE GROUPING 3_1 TEMPERATURE_DIFF_1 0.000746
516 BLOODPRESSURE_SISTOLIC_DIFF_1 HEART_RATE_MEDIAN_1 0.000576
194 PATIENT_VISIT_IDENTIFIER_1 TEMPERATURE_MIN_1 0.000536
1070 DISEASE GROUPING 6_1 RESPIRATORY_RATE_MIN_1 0.000049

1128 rows × 3 columns

Observation:

  • This has been converted to stacked format. Some feature present quite the correlation coefficient. Let's take a closer look at the ones in which the Pearson_Corr is larger than 0.99.
In [101]:
#Filter very strong correlations
stacked_data_1_corr[stacked_data_1_corr['Pearson_Corr'] > 0.99]
Out[101]:
Feature_1 Feature_2 Pearson_Corr
2101 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 0.999444
1501 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 0.998889
651 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 0.996952
1251 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 0.993374
2251 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 0.993236
351 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 0.992260
1651 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MEDIAN_1 0.990731
501 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 0.990595

Observation: What we see here is that there are two measurement combinations which tend to present a strong correlation between each other. They are:

  • MEAN/MEDIAN.
  • DIFF/DIFF_REL

Well, it is exactly surprising to see this behavior. Still, it is better have proof than to base our feature selection in assumptions. We are going even deeper and try to see if this assessment is true in every case for out dataset.

In [102]:
#Investigate MEAN/MEDIAN correlations
stacked_data_1_corr['MEASURE_FEATURE_1'] = [x.split('_')[0] for x in stacked_data_1_corr['Feature_1']]
stacked_data_1_corr['MEASURE_FEATURE_2'] = [x.split('_')[0] for x in stacked_data_1_corr['Feature_2']]
stacked_data_1_corr['TYPE_FEATURE_1'] = [x.split('_')[-2] for x in stacked_data_1_corr['Feature_1']]
stacked_data_1_corr['TYPE_FEATURE_2'] = [x.split('_')[-2] for x in stacked_data_1_corr['Feature_2']]

mean_median_corr = stacked_data_1_corr.query('MEASURE_FEATURE_1 == MEASURE_FEATURE_2')
mean_median_corr = mean_median_corr.query('TYPE_FEATURE_1 != TYPE_FEATURE_2')
mean_median_corr = mean_median_corr[mean_median_corr['TYPE_FEATURE_1'].isin(['MEDIAN', 'MEAN'])]
mean_median_corr = mean_median_corr[mean_median_corr['TYPE_FEATURE_2'].isin(['MEDIAN', 'MEAN'])]

relevant_cols = ['Feature_1', 'Feature_2', 'Pearson_Corr']
mean_median_corr[relevant_cols]
Out[102]:
Feature_1 Feature_2 Pearson_Corr
651 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 0.996952
1251 HEART_RATE_MEAN_1 HEART_RATE_MEDIAN_1 0.993374
2251 TEMPERATURE_MEAN_1 TEMPERATURE_MEDIAN_1 0.993236
351 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 0.992260
1651 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MEDIAN_1 0.990731
1951 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MEDIAN_1 0.989124
357 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MEDIAN_1 0.560540
405 BLOODPRESSURE_DIASTOLIC_MEDIAN_1 BLOODPRESSURE_SISTOLIC_MEAN_1 0.539823

Well, the table shows there are no exceptions to our previous assertion. Still, there are two smaller correlations values above. If you look closer, however, Feature_1 and Feature_2 are not part of the same measurement. They probably just slipped through our filters and don't matter for this specific step.

In the end, this means we do not need both MEAN and MEDIAN features for a given measurement type. So, we are going to stick with the MEAN attributes for now.

In [103]:
#List columns to be removed
cols_to_remove = ['BLOODPRESSURE_DIASTOLIC_MEDIAN_1', 'BLOODPRESSURE_SISTOLIC_MEDIAN_1', 'HEART_RATE_MEDIAN_1',
                  'OXYGEN_SATURATION_MEDIAN_1', 'RESPIRATORY_RATE_MEDIAN_1', 'TEMPERATURE_MEDIAN_1']
In [104]:
#Investigate DIFF/DIFF_REL correlations
diff_corr = stacked_data_1_corr.query('MEASURE_FEATURE_1 == MEASURE_FEATURE_2')
diff_corr = diff_corr.query('TYPE_FEATURE_1 != TYPE_FEATURE_2')
diff_corr = diff_corr[diff_corr['TYPE_FEATURE_1'].isin(['DIFF', 'REL'])]
diff_corr = diff_corr[diff_corr['TYPE_FEATURE_2'].isin(['DIFF', 'REL'])]

diff_corr[relevant_cols]
Out[104]:
Feature_1 Feature_2 Pearson_Corr
2101 TEMPERATURE_DIFF_1 TEMPERATURE_DIFF_REL_1 0.999444
1501 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_DIFF_REL_1 0.998889
501 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 0.990595
201 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 0.978508
1801 RESPIRATORY_RATE_DIFF_1 RESPIRATORY_RATE_DIFF_REL_1 0.961739
1101 HEART_RATE_DIFF_1 HEART_RATE_DIFF_REL_1 0.949914
207 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_DIFF_REL_1 0.799433
255 BLOODPRESSURE_DIASTOLIC_DIFF_REL_1 BLOODPRESSURE_SISTOLIC_DIFF_1 0.777548

Once again, no exceptions are found for the DIFF/DIFF_REL correlation behavior. As in the previous table, the second and third records are to be neglected, since they are not actually comparing the same measurement type. For this duo, we choose the DIFF_REL features to remove.

In [105]:
#Add columns to the remove list
cols_to_remove.extend(['BLOODPRESSURE_DIASTOLIC_DIFF_REL_1', 'BLOODPRESSURE_SISTOLIC_DIFF_REL_1', 
                       'HEART_RATE_DIFF_REL_1', 'OXYGEN_SATURATION_DIFF_REL_1', 'RESPIRATORY_RATE_DIFF_REL_1',
                       'TEMPERATURE_DIFF_REL_1'])

Now that we have already evaluated the most extreme correlation cases between features, let's see how these attributes relate to the target column.

In [106]:
#Sort absolute correlations values to the target
data_1 = data_1.drop(columns = cols_to_remove)
data_1_target_corr = abs(data_1.corr()['ICU'])

data_1_target_corr[data_1_target_corr < 1].sort_values(ascending = False)
Out[106]:
AGE_ABOVE65_1                     0.291010
RESPIRATORY_RATE_MAX_1            0.213938
RESPIRATORY_RATE_MEAN_1           0.207911
BLOODPRESSURE_DIASTOLIC_MEAN_1    0.201210
BLOODPRESSURE_DIASTOLIC_MIN_1     0.195703
HTN_1                             0.180555
RESPIRATORY_RATE_MIN_1            0.173043
BLOODPRESSURE_DIASTOLIC_MAX_1     0.166832
OXYGEN_SATURATION_MEAN_1          0.147612
OXYGEN_SATURATION_MIN_1           0.139034
OXYGEN_SATURATION_MAX_1           0.131615
DISEASE GROUPING 3_1              0.122514
DISEASE GROUPING 5_1              0.122200
GENDER_1                          0.117938
DISEASE GROUPING 4_1              0.112573
BLOODPRESSURE_SISTOLIC_MAX_1      0.109073
BLOODPRESSURE_SISTOLIC_DIFF_1     0.107106
RESPIRATORY_RATE_DIFF_1           0.093877
DISEASE GROUPING 2_1              0.087753
TEMPERATURE_MEAN_1                0.086764
TEMPERATURE_MIN_1                 0.086575
BLOODPRESSURE_SISTOLIC_MEAN_1     0.084371
TEMPERATURE_MAX_1                 0.079548
DISEASE GROUPING 1_1              0.071825
IMMUNOCOMPROMISED_1               0.071221
BLOODPRESSURE_DIASTOLIC_DIFF_1    0.065228
BLOODPRESSURE_SISTOLIC_MIN_1      0.058086
OTHER_1                           0.050656
HEART_RATE_MEAN_1                 0.048263
HEART_RATE_MAX_1                  0.047453
HEART_RATE_MIN_1                  0.042645
PATIENT_VISIT_IDENTIFIER_1        0.041382
DISEASE GROUPING 6_1              0.026684
OXYGEN_SATURATION_DIFF_1          0.020897
HEART_RATE_DIFF_1                 0.013554
TEMPERATURE_DIFF_1                0.006336
Name: ICU, dtype: float64

This data does not provide us with a lot. However there's some observations we can make. First of all, the most strognly correlated feature, AGE_ABOVE65_1, confirms the elderly are more affected by the desease. Following, looking at the DIFF attributes, they are mostly positioned at weaker correlated half of our features. This is not actually a surprise considering how poorly distributed they are.

Feature Encoding¶

There is still one feature when examining our data that needs additional processing: AGE PERCENTIL 1. Let's start the encoding process.

In [107]:
data_1.head()
Out[107]:
AGE_ABOVE65_1 AGE_PERCENTIL_1 GENDER_1 HTN_1 PATIENT_VISIT_IDENTIFIER_1 BLOODPRESSURE_DIASTOLIC_DIFF_1 BLOODPRESSURE_DIASTOLIC_MAX_1 BLOODPRESSURE_DIASTOLIC_MEAN_1 BLOODPRESSURE_DIASTOLIC_MIN_1 BLOODPRESSURE_SISTOLIC_DIFF_1 BLOODPRESSURE_SISTOLIC_MAX_1 BLOODPRESSURE_SISTOLIC_MEAN_1 BLOODPRESSURE_SISTOLIC_MIN_1 DISEASE GROUPING 1_1 DISEASE GROUPING 2_1 DISEASE GROUPING 3_1 DISEASE GROUPING 4_1 DISEASE GROUPING 5_1 DISEASE GROUPING 6_1 HEART_RATE_DIFF_1 HEART_RATE_MAX_1 HEART_RATE_MEAN_1 HEART_RATE_MIN_1 IMMUNOCOMPROMISED_1 OTHER_1 OXYGEN_SATURATION_DIFF_1 OXYGEN_SATURATION_MAX_1 OXYGEN_SATURATION_MEAN_1 OXYGEN_SATURATION_MIN_1 RESPIRATORY_RATE_DIFF_1 RESPIRATORY_RATE_MAX_1 RESPIRATORY_RATE_MEAN_1 RESPIRATORY_RATE_MIN_1 TEMPERATURE_DIFF_1 TEMPERATURE_MAX_1 TEMPERATURE_MEAN_1 TEMPERATURE_MIN_1 ICU
0 1 60th 0 0.0 0 -1.000000 -0.247863 0.086420 0.237113 -1.000000 -0.459459 -0.230769 0.0000 0.0 0.0 0.0 0.0 1.0 1.0 -1.000000 -0.432836 -0.283019 -0.162393 0.0 1.0 -1.000000 0.736842 0.736842 0.898990 -1.000000 -0.636364 -0.593220 -0.500000 -1.000000 -0.420290 -0.285714 0.208791 1
2 0 10th 0 0.0 2 -0.547826 -0.435897 -0.489712 -0.525773 -0.533742 -0.491892 -0.685470 -0.5125 0.0 0.0 0.0 0.0 0.0 0.0 -0.603053 0.000000 -0.048218 -0.111111 0.0 1.0 -0.959596 1.000000 0.935673 0.959596 -0.764706 -0.575758 -0.645951 -0.714286 -1.000000 0.101449 0.357143 0.604396 1
3 0 40th 1 0.0 3 -1.000000 -0.299145 0.012346 0.175258 -1.000000 -0.556757 -0.369231 -0.1125 0.0 0.0 0.0 0.0 0.0 0.0 -1.000000 -0.626866 -0.528302 -0.384615 1.0 1.0 -1.000000 0.684211 0.684211 0.878788 -1.000000 -0.515152 -0.457627 -0.357143 -1.000000 -0.420290 -0.285714 0.208791 0
4 0 10th 0 0.0 4 -1.000000 -0.076923 0.333333 0.443299 -0.877301 -0.351351 -0.153846 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 -0.923664 -0.044776 0.160377 0.196581 0.0 1.0 -0.979798 0.894737 0.868421 0.939394 -0.882353 -0.575758 -0.593220 -0.571429 -0.952381 0.072464 0.285714 0.538462 0
5 0 10th 0 0.0 5 -0.826087 -0.247863 -0.037037 0.030928 -0.754601 -0.567568 -0.538462 -0.3750 0.0 0.0 0.0 0.0 0.0 0.0 -0.984733 -0.626866 -0.537736 -0.401709 0.0 1.0 -0.979798 0.842105 0.815789 0.919192 -1.000000 -0.575758 -0.525424 -0.428571 -0.976190 -0.333333 -0.196429 0.252747 0
In [108]:
#Define function to encode features
def encode_feature(data, col):
    new_cols = pd.get_dummies(data[col], prefix = col, prefix_sep = ':', drop_first = True)
    return pd.concat([data.drop(columns = col), new_cols], axis = 1)
In [109]:
#Encode AGE_PERCENTIL_1
data_1 = encode_feature(data_1, 'AGE_PERCENTIL_1')
In [110]:
data_1.columns
Out[110]:
Index(['AGE_ABOVE65_1', 'GENDER_1', 'HTN_1', 'PATIENT_VISIT_IDENTIFIER_1',
       'BLOODPRESSURE_DIASTOLIC_DIFF_1', 'BLOODPRESSURE_DIASTOLIC_MAX_1',
       'BLOODPRESSURE_DIASTOLIC_MEAN_1', 'BLOODPRESSURE_DIASTOLIC_MIN_1',
       'BLOODPRESSURE_SISTOLIC_DIFF_1', 'BLOODPRESSURE_SISTOLIC_MAX_1',
       'BLOODPRESSURE_SISTOLIC_MEAN_1', 'BLOODPRESSURE_SISTOLIC_MIN_1',
       'DISEASE GROUPING 1_1', 'DISEASE GROUPING 2_1', 'DISEASE GROUPING 3_1',
       'DISEASE GROUPING 4_1', 'DISEASE GROUPING 5_1', 'DISEASE GROUPING 6_1',
       'HEART_RATE_DIFF_1', 'HEART_RATE_MAX_1', 'HEART_RATE_MEAN_1',
       'HEART_RATE_MIN_1', 'IMMUNOCOMPROMISED_1', 'OTHER_1',
       'OXYGEN_SATURATION_DIFF_1', 'OXYGEN_SATURATION_MAX_1',
       'OXYGEN_SATURATION_MEAN_1', 'OXYGEN_SATURATION_MIN_1',
       'RESPIRATORY_RATE_DIFF_1', 'RESPIRATORY_RATE_MAX_1',
       'RESPIRATORY_RATE_MEAN_1', 'RESPIRATORY_RATE_MIN_1',
       'TEMPERATURE_DIFF_1', 'TEMPERATURE_MAX_1', 'TEMPERATURE_MEAN_1',
       'TEMPERATURE_MIN_1', 'ICU', 'AGE_PERCENTIL_1:20th',
       'AGE_PERCENTIL_1:30th', 'AGE_PERCENTIL_1:40th', 'AGE_PERCENTIL_1:50th',
       'AGE_PERCENTIL_1:60th', 'AGE_PERCENTIL_1:70th', 'AGE_PERCENTIL_1:80th',
       'AGE_PERCENTIL_1:90th', 'AGE_PERCENTIL_1:Above 90th'],
      dtype='object')
In [111]:
data_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 352 entries, 0 to 384
Data columns (total 46 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   AGE_ABOVE65_1                   352 non-null    int64  
 1   GENDER_1                        352 non-null    int64  
 2   HTN_1                           352 non-null    float64
 3   PATIENT_VISIT_IDENTIFIER_1      352 non-null    int64  
 4   BLOODPRESSURE_DIASTOLIC_DIFF_1  352 non-null    float64
 5   BLOODPRESSURE_DIASTOLIC_MAX_1   352 non-null    float64
 6   BLOODPRESSURE_DIASTOLIC_MEAN_1  352 non-null    float64
 7   BLOODPRESSURE_DIASTOLIC_MIN_1   352 non-null    float64
 8   BLOODPRESSURE_SISTOLIC_DIFF_1   352 non-null    float64
 9   BLOODPRESSURE_SISTOLIC_MAX_1    352 non-null    float64
 10  BLOODPRESSURE_SISTOLIC_MEAN_1   352 non-null    float64
 11  BLOODPRESSURE_SISTOLIC_MIN_1    352 non-null    float64
 12  DISEASE GROUPING 1_1            352 non-null    float64
 13  DISEASE GROUPING 2_1            352 non-null    float64
 14  DISEASE GROUPING 3_1            352 non-null    float64
 15  DISEASE GROUPING 4_1            352 non-null    float64
 16  DISEASE GROUPING 5_1            352 non-null    float64
 17  DISEASE GROUPING 6_1            352 non-null    float64
 18  HEART_RATE_DIFF_1               352 non-null    float64
 19  HEART_RATE_MAX_1                352 non-null    float64
 20  HEART_RATE_MEAN_1               352 non-null    float64
 21  HEART_RATE_MIN_1                352 non-null    float64
 22  IMMUNOCOMPROMISED_1             352 non-null    float64
 23  OTHER_1                         352 non-null    float64
 24  OXYGEN_SATURATION_DIFF_1        352 non-null    float64
 25  OXYGEN_SATURATION_MAX_1         352 non-null    float64
 26  OXYGEN_SATURATION_MEAN_1        352 non-null    float64
 27  OXYGEN_SATURATION_MIN_1         352 non-null    float64
 28  RESPIRATORY_RATE_DIFF_1         352 non-null    float64
 29  RESPIRATORY_RATE_MAX_1          352 non-null    float64
 30  RESPIRATORY_RATE_MEAN_1         352 non-null    float64
 31  RESPIRATORY_RATE_MIN_1          352 non-null    float64
 32  TEMPERATURE_DIFF_1              352 non-null    float64
 33  TEMPERATURE_MAX_1               352 non-null    float64
 34  TEMPERATURE_MEAN_1              352 non-null    float64
 35  TEMPERATURE_MIN_1               352 non-null    float64
 36  ICU                             352 non-null    int64  
 37  AGE_PERCENTIL_1:20th            352 non-null    uint8  
 38  AGE_PERCENTIL_1:30th            352 non-null    uint8  
 39  AGE_PERCENTIL_1:40th            352 non-null    uint8  
 40  AGE_PERCENTIL_1:50th            352 non-null    uint8  
 41  AGE_PERCENTIL_1:60th            352 non-null    uint8  
 42  AGE_PERCENTIL_1:70th            352 non-null    uint8  
 43  AGE_PERCENTIL_1:80th            352 non-null    uint8  
 44  AGE_PERCENTIL_1:90th            352 non-null    uint8  
 45  AGE_PERCENTIL_1:Above 90th      352 non-null    uint8  
dtypes: float64(33), int64(4), uint8(9)
memory usage: 107.6 KB

Observation:

  • The dataset consists of only float and int

Modelling and Model Evaluation ¶

We have already done some feature selection, so it would be adviseable to test how a model does with our current data at this stage before moving on. This can serve as a baseline model because our dataset is not very big and there are not many features left. We can next assess whether eliminating or re-engineering elements can sustain or even enhance from this initial attempt.

Let's try using 8 different algorithm methods in this first effort. Our decision is predicated on the possibility that one of model would be reasonably simple to interpret.

In [112]:
#Split data into train/test and validation
np.random.seed(10)

target_col = 'ICU'
feature_cols = data_1.drop(columns = ['ICU', 'PATIENT_VISIT_IDENTIFIER_1']).columns.values

x_train, x_validation, y_train, y_validation = train_test_split(data_1[feature_cols], data_1[target_col],
                                                                test_size = 0.1)
In [113]:
#Define function to test algorithm
def score_model(estimator, train_data, validation_data, cv):
    #Unpack data
    x_train, y_train = train_data
    x_validation, y_validation = validation_data
    
    #Perfomed cross-validation on train data
    model_cv = cross_validate(estimator = estimator, X = x_train, y = y_train,
                              scoring = ['accuracy', 'roc_auc'],
                              cv = cv)
    
    #Apply model to validation data
    estimator.fit(x_train, y_train)
    y_pred = estimator.predict(x_validation)

    #Print results
    print('CV model accuracy:  %.3f +/- %.3f'  %(model_cv['test_accuracy'].mean(), 
                                              model_cv['test_accuracy'].std()))
    print('CV model roc_auc:  %.3f +/- %.3f'  %(model_cv['test_roc_auc'].mean(), 
                                             model_cv['test_roc_auc'].std()))
    print('Validation accuracy score: %.3f' %accuracy_score(y_validation, y_pred))
    print('Validation ROC_AUC score: %.3f' %roc_auc_score(y_validation, y_pred))
    
    return estimator
In [114]:
clfs = {"LogisticRegression":LogisticRegression(), 
        "SVM":SVC(kernel='rbf', probability=True),
        "Decision":DecisionTreeClassifier(), 
        "RandomForest":RandomForestClassifier(), 
        "GradientBoost":GradientBoostingClassifier(),
        "XGBoost":XGBClassifier(verbosity=0), 
        "KNN":KNN(),
        "CatBoost":CatBoostClassifier(verbose=False)}
In [115]:
def model_fit(clfs):
    fitted_model={}
    model_result = pd.DataFrame()
    for model_name, model in clfs.items():
        model.fit(x_train,y_train)
        fitted_model.update({model_name:model})
        y_pred = model.predict(x_validation)
        model_dict = {}
        model_dict['1.Algorithm'] = model_name
        model_dict['2.Accuracy'] = round(accuracy_score(y_validation, y_pred),3)
        model_dict['3.Precision'] = round(precision_score(y_validation, y_pred),3)
        model_dict['4.Recall'] = round(recall_score(y_validation, y_pred),3)
        model_dict['5.F1'] = round(f1_score(y_validation, y_pred),3)
        model_dict['6.ROC'] = round(roc_auc_score(y_validation, y_pred),3)
        model_result = model_result.append(model_dict,ignore_index=True)
    return fitted_model, model_result
In [116]:
fitted_model, model_result = model_fit(clfs)
In [117]:
model_result.sort_values(by=['2.Accuracy'],ascending=False)
Out[117]:
1.Algorithm 2.Accuracy 3.Precision 4.Recall 5.F1 6.ROC
3 RandomForest 0.750 0.733 0.688 0.710 0.744
4 GradientBoost 0.750 0.769 0.625 0.690 0.737
0 LogisticRegression 0.722 0.800 0.500 0.615 0.700
5 XGBoost 0.722 0.688 0.688 0.688 0.719
7 CatBoost 0.722 0.714 0.625 0.667 0.712
6 KNN 0.694 0.692 0.562 0.621 0.681
1 SVM 0.611 0.583 0.438 0.500 0.594
2 Decision 0.583 0.522 0.750 0.615 0.600
In [118]:
model_result["1.Algorithm"][2:]
Out[118]:
2         Decision
3     RandomForest
4    GradientBoost
5          XGBoost
6              KNN
7         CatBoost
Name: 1.Algorithm, dtype: object
In [119]:
model_result.sort_values(by=['2.Accuracy'],ascending=False)
Out[119]:
1.Algorithm 2.Accuracy 3.Precision 4.Recall 5.F1 6.ROC
3 RandomForest 0.750 0.733 0.688 0.710 0.744
4 GradientBoost 0.750 0.769 0.625 0.690 0.737
0 LogisticRegression 0.722 0.800 0.500 0.615 0.700
5 XGBoost 0.722 0.688 0.688 0.688 0.719
7 CatBoost 0.722 0.714 0.625 0.667 0.712
6 KNN 0.694 0.692 0.562 0.621 0.681
1 SVM 0.611 0.583 0.438 0.500 0.594
2 Decision 0.583 0.522 0.750 0.615 0.600
In [120]:
model_ordered = []
weights = []
i=1
for model_name in model_result['1.Algorithm'][
    index_natsorted(model_result['2.Accuracy'], reverse=False)]:
    model_ordered.append([model_name, clfs.get(model_name)])
    weights.append(math.exp(i))
    i+=0.8 
In [121]:
plt.plot(weights)
plt.show()
In [122]:
weights
Out[122]:
[2.718281828459045,
 6.0496474644129465,
 13.463738035001692,
 29.964100047397025,
 66.68633104092515,
 148.4131591025766,
 330.2995599096486,
 735.0951892419727]
In [123]:
model_ordered
Out[123]:
[['Decision', DecisionTreeClassifier()],
 ['SVM', SVC(probability=True)],
 ['KNN', KNeighborsClassifier()],
 ['LogisticRegression', LogisticRegression()],
 ['XGBoost',
  XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
                early_stopping_rounds=None, enable_categorical=False,
                eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
                grow_policy='depthwise', importance_type=None,
                interaction_constraints='', learning_rate=0.300000012,
                max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
                max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
                missing=nan, monotone_constraints='()', n_estimators=100,
                n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...)],
 ['CatBoost', <catboost.core.CatBoostClassifier at 0x162af7e6400>],
 ['RandomForest', RandomForestClassifier()],
 ['GradientBoost', GradientBoostingClassifier()]]
In [124]:
vc = VotingClassifier(estimators=model_ordered, weights=weights)
In [125]:
clfs_new = clfs.copy()
In [126]:
clfs_new.update({"VotingClassifier":vc})
In [127]:
clfs_new
Out[127]:
{'LogisticRegression': LogisticRegression(),
 'SVM': SVC(probability=True),
 'Decision': DecisionTreeClassifier(),
 'RandomForest': RandomForestClassifier(),
 'GradientBoost': GradientBoostingClassifier(),
 'XGBoost': XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
               colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
               early_stopping_rounds=None, enable_categorical=False,
               eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
               grow_policy='depthwise', importance_type=None,
               interaction_constraints='', learning_rate=0.300000012,
               max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
               max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
               missing=nan, monotone_constraints='()', n_estimators=100,
               n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...),
 'KNN': KNeighborsClassifier(),
 'CatBoost': <catboost.core.CatBoostClassifier at 0x162af7e6400>,
 'VotingClassifier': VotingClassifier(estimators=[['Decision', DecisionTreeClassifier()],
                              ['SVM', SVC(probability=True)],
                              ['KNN', KNeighborsClassifier()],
                              ['LogisticRegression', LogisticRegression()],
                              ['XGBoost',
                               XGBClassifier(base_score=0.5, booster='gbtree',
                                             callbacks=None, colsample_bylevel=1,
                                             colsample_bynode=1,
                                             colsample_bytree=1,
                                             early_stopping_rounds=None,
                                             enable_categorical=Fa...
                                             predictor='auto', random_state=0, ...)],
                              ['CatBoost',
                               <catboost.core.CatBoostClassifier object at 0x00000162AF7E6400>],
                              ['RandomForest', RandomForestClassifier()],
                              ['GradientBoost', GradientBoostingClassifier()]],
                  weights=[2.718281828459045, 6.0496474644129465,
                           13.463738035001692, 29.964100047397025,
                           66.68633104092515, 148.4131591025766,
                           330.2995599096486, 735.0951892419727])}
In [128]:
fitted_model_new, model_result_new = model_fit(clfs_new)
In [129]:
model_result_new.sort_values(by=['2.Accuracy'],ascending=False)
Out[129]:
1.Algorithm 2.Accuracy 3.Precision 4.Recall 5.F1 6.ROC
4 GradientBoost 0.750 0.769 0.625 0.690 0.737
0 LogisticRegression 0.722 0.800 0.500 0.615 0.700
3 RandomForest 0.722 0.688 0.688 0.688 0.719
5 XGBoost 0.722 0.688 0.688 0.688 0.719
7 CatBoost 0.722 0.714 0.625 0.667 0.712
8 VotingClassifier 0.722 0.714 0.625 0.667 0.712
6 KNN 0.694 0.692 0.562 0.621 0.681
1 SVM 0.611 0.583 0.438 0.500 0.594
2 Decision 0.556 0.500 0.688 0.579 0.569
In [130]:
#Test RandomForestClassifier model
baseline_model_2 = RandomForestClassifier()
fitted_baseline_model_2 = score_model(estimator = baseline_model_2, 
                                      train_data = (x_train, y_train),
                                      validation_data = (x_validation, y_validation),
                                      cv = 10)
CV model accuracy:  0.661 +/- 0.060
CV model roc_auc:  0.730 +/- 0.077
Validation accuracy score: 0.722
Validation ROC_AUC score: 0.719

Observation: Some of the results are particularly outstanding, especially given that 46% of genuine values in our desired frequency. A perfect model will have an ROC-AUC score of 1, while a model that is no better than random guessing will have an ROC-AUC score of 0.5. Hence, algorithm such as KNN, DecsionTree, SVM hace low ROC-AUC score

Cross-validation has several advantages over other methods for evaluating the performance of a model. For example, it can provide a more accurate estimate of the model's performance because it uses more of the data for training and evaluation. It can also be used to tune hyperparameters, which are model-specific parameters that cannot be learned from the data.

The results are significantly better on RandomForest. This is an excellent sign because some of the models wasn't significantly more accurate than picking patients at random to be admitted to the intensive care unit. There are various directions we can go from here:

  • Further perform feature selection on the dataset in order to remove or add features;
  • Tune the algorithm hyperparameters to see if we can achieve some accuracy improvement;
  • Discard this method and proceed to a time-relevant model.

The best course of action is to continue researching our window-1 dataset. We can start by taking a look at the Random Forest model's feature crucial characteristics.

Feature Selection¶

In [131]:
#Plot feature importances for the Random Forest Model
feat_importances = pd.Series(data = fitted_baseline_model_2.feature_importances_,
                             index = feature_cols).sort_values()
feat_importances.plot(kind = 'barh', figsize = (12, 11))
Out[131]:
<AxesSubplot:>

Observations:

  1. RESPIRATORY_RATE_MEAN_1 is the most important feature on this classification model.
  2. After OXYGEN_SATURATION_MAX_1, there is a significant dip on how important the attributes are for the algorithm.

By removing the least significant characteristic from our dataset and observing how the same algorithm responds, we may examine the veracity of this judgement.

In [132]:
#Test RandomForest model on reduced dataset
ncols_to_keep = int(0.8 * len(feat_importances))
reduced_feat_columns = feat_importances.nlargest(n = ncols_to_keep).index

reduced_x_train = x_train[reduced_feat_columns]
reduced_x_validation = x_validation[reduced_feat_columns]

fitted_baseline_model_3 = score_model(estimator = baseline_model_2, 
                                      train_data = (reduced_x_train, y_train),
                                      validation_data = (reduced_x_validation, y_validation),
                                      cv = 10)
CV model accuracy:  0.658 +/- 0.073
CV model roc_auc:  0.732 +/- 0.083
Validation accuracy score: 0.806
Validation ROC_AUC score: 0.800
In [133]:
ncols_to_keep
Out[133]:
35

The dataset contains 35 columns. It is difficult to draw inferences from these results when compared to the previous result on dataset version of the model given above. However, the results of the cross-validation do not show a significant shift. The validation results are obviously superior.

Applying a recursive feature selection strategy makes sense if we still want to investigate the possibility of decreasing the number of features in our dataset. Given the limited dataset, it shouldn't take too long. We can utilise a condensed variant of the Random Forest algorithm to further reduce processing time issues. The RFE method is used to select a subset of the features that are the most important for making predictions

In [134]:
#Get RFE feature ranking and compare to RandomForestClassifier feature importance
rfe_model = RandomForestClassifier(n_estimators = 10)
feature_selector = RFE(estimator = rfe_model, step = 1)
feature_selector.fit(x_train, y_train)
feature_ranking = pd.Series(data = feature_selector.ranking_, index = feature_cols).sort_values()

importance_scale = pd.concat([feature_ranking, feat_importances.rank(ascending = False)], axis = 1)
importance_scale = importance_scale.rename(columns = {0: 'RFE_ranking', 1: 'RFC_ranking'})
importance_scale.sort_values(by = ['RFE_ranking', 'RFC_ranking'])
Out[134]:
RFE_ranking RFC_ranking
RESPIRATORY_RATE_MEAN_1 1 1.0
RESPIRATORY_RATE_MAX_1 1 2.0
BLOODPRESSURE_DIASTOLIC_MEAN_1 1 3.0
BLOODPRESSURE_DIASTOLIC_MAX_1 1 4.0
BLOODPRESSURE_SISTOLIC_MAX_1 1 5.0
HEART_RATE_MIN_1 1 6.0
TEMPERATURE_MIN_1 1 7.0
RESPIRATORY_RATE_MIN_1 1 8.0
HEART_RATE_MAX_1 1 9.0
BLOODPRESSURE_DIASTOLIC_MIN_1 1 10.0
TEMPERATURE_MEAN_1 1 11.0
HEART_RATE_MEAN_1 1 12.0
BLOODPRESSURE_SISTOLIC_MEAN_1 1 13.0
AGE_ABOVE65_1 1 14.0
BLOODPRESSURE_SISTOLIC_MIN_1 1 15.0
OXYGEN_SATURATION_MEAN_1 1 16.0
TEMPERATURE_MAX_1 1 17.0
OXYGEN_SATURATION_MIN_1 1 18.0
OXYGEN_SATURATION_MAX_1 1 19.0
TEMPERATURE_DIFF_1 1 21.0
AGE_PERCENTIL_1:90th 1 22.0
AGE_PERCENTIL_1:80th 1 23.0
GENDER_1 2 31.0
BLOODPRESSURE_DIASTOLIC_DIFF_1 3 30.0
HTN_1 4 20.0
RESPIRATORY_RATE_DIFF_1 5 27.0
OXYGEN_SATURATION_DIFF_1 6 25.0
OTHER_1 7 26.0
AGE_PERCENTIL_1:20th 8 35.0
DISEASE GROUPING 5_1 9 39.0
HEART_RATE_DIFF_1 10 28.0
AGE_PERCENTIL_1:Above 90th 11 29.0
BLOODPRESSURE_SISTOLIC_DIFF_1 12 24.0
IMMUNOCOMPROMISED_1 13 37.0
DISEASE GROUPING 3_1 14 33.0
AGE_PERCENTIL_1:60th 15 36.0
DISEASE GROUPING 2_1 16 34.0
AGE_PERCENTIL_1:50th 17 32.0
DISEASE GROUPING 6_1 18 44.0
AGE_PERCENTIL_1:30th 19 38.0
AGE_PERCENTIL_1:40th 20 40.0
AGE_PERCENTIL_1:70th 21 41.0
DISEASE GROUPING 4_1 22 42.0
DISEASE GROUPING 1_1 23 43.0

Obversation:

  • It is evident from both rankings that both feature selection methodologies share certain similarities. This is not always a good thing, as our findings regarding the elimination of the least significant traits were inconclusive. Let's test the classification model's performance by only using the better half of the features, as determined by the RFE algorithm.
In [135]:
#Test RandomForest model for RFE reduced dataset
cols_to_keep = importance_scale[importance_scale['RFE_ranking'] == 1].index
reduced_x_train = x_train[cols_to_keep]
reduced_x_validation = x_validation[cols_to_keep]

fitted_baseline_model_4 = score_model(estimator = baseline_model_2, 
                                      train_data = (reduced_x_train, y_train),
                                      validation_data = (reduced_x_validation, y_validation),
                                      cv = 10)
CV model accuracy:  0.677 +/- 0.061
CV model roc_auc:  0.727 +/- 0.082
Validation accuracy score: 0.750
Validation ROC_AUC score: 0.744
In [136]:
len(cols_to_keep)
Out[136]:
22

Observation:

  • This was perfrom on a dataset with 22 columns. The outcomes seem to be marginally superior to those of the whole-dataset model. The margin is too little for us to make any audacious claims.

  • We did get definitive results from the two tests we ran. Additionally, given that these qualities are obtained from a small number of measurements, we are not dealing with a large number of attributes. The last step is to determine how much accuracy can be increased by adjusting the hyperparameters.

Hyperparameters Tuning¶

The Random Forest method served as the foundation for our top model. We will only spend time fine-tuning its hyperparemeters in this manner. Only a subset of the hyperparameter will be explored in order to shorten the time we will spend on this method. As follows:

  • n_estimators
  • criterion
  • max_depth
  • max_features

We can easily conduct a grid search to find the ideal collection of hyperparameters because our dataset is small.

In [137]:
#Define hyperparameter space
hyper_space = {
    'n_estimators': [10, 100, 500],
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'max_features': ['sqrt', 'log2', None]
}
In [138]:
#Perform hyperparameter tuning by grid searching the defined space
grid_search = GridSearchCV(estimator = baseline_model_2, 
                           param_grid = hyper_space,
                           scoring = 'roc_auc',
                           cv = 10,
                           n_jobs = 4,
                           verbose = 1)
grid_search_results = grid_search.fit(x_train, y_train)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
In [140]:
#Look at the best performing set of hyperparameters and apply estimator on validation data
print(grid_search.best_params_)

best_gridsearch_model = grid_search.best_estimator_
best_gridsearch_model.fit(x_train, y_train)
y_pred = best_gridsearch_model.predict(x_validation)

print('Validation accuracy: %.3f' %(accuracy_score(y_validation, y_pred)))
print('Validation ROC_AUC: %.3f' %(roc_auc_score(y_validation, y_pred)))
{'criterion': 'gini', 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 500}
Validation accuracy: 0.806
Validation ROC_AUC: 0.794

Observation: We were able to enhance the validation accuracy by around 6 percentage just by adjusting the algorithm's settings. Additionally, we have reached a point where our model should be able to forecast patients who will need an ICU bed in more than 80% of situations.

CONCLUSION¶

We were able to develop prediction models on this notebook for the ICU admission classification issue. For each patient, it concentrated on the earliest data available, creating a model that was reasonably accurate. The model's ability to successfully categorise patients for both goal values is one sign that this data processing stage was successful.

We must reiterate our caution that working with tiny datasets restricts how confident we can be in our findings.

Reference: COVID-19 - Clinical Data to assess diagnosis. (2020). ICU Admission Data: Classification Model . Accessed

20/11/2022, from https://www.kaggle.com/code/epdrumond/icu-admission-data-classification-modelprediction/overview

Special thanks to EDILSON DRUMOND for assisting and providing better insight on his code via LinkedIn, as this served as a foundation to my assignment